Advertisement

Moores Law

Started by February 02, 2013 07:59 AM
16 comments, last by Sirisian 11 years, 9 months ago

Moore only said that transistor counts double every 2-3 years, over the years its become synonymous with "processing power doubles every three years", which is really what the consumer cares about anyhow. There's a relationship between performance and number of transistors, sure, but no one cares about how many transistors their CPU sports.

To continue the single-threaded performance curve, I suspect that chip manufacturers will find ways to get around the slowing of moore's law. As we approach the limit of transitor shrinks on current technology, we'll just switch to new materials and processes, make bigger and bigger dies, bigger and bigger wafers, and move into '3-D' chips--stacking 2-D silicon dies one on top of the other. There's still hurdles there today, but they're far from insurmountable.

I also think that the transition to predominantly-multi-threaded, asynchronous programming can happen quite gradually. As those practices become more mainstream, a chip manufacturer can build future CPUs out of cores that are, for example, only half as fast, but consume only 25% of the area of previous cores -- thus, aggregate performance doubles without affecting area. As apps take advantage of this new way of doing things, they'll still seem twice as fast. That said, some problems are necessarily sequential, and so there will probably always be a place for high single-threaded performance. In the future, I can foresee a CPU that has 1-2 'big' cores, and 8-16 (or more) smaller cores -- all functionally identical, but the big ones chasing highest-possible IPC, and the little ones chasing smallest reasonable surface area. It would be best for OSes to know about these different kinds of CPUs, but you can even handle scheduling to the appropriate kind of CPU at a silicon level, and dynamically move between kinds of cores.

Another angle is stream-processing (GPGPU, et all) which accounts for a significant portion of our heaviest workloads. There, they've already figured out how to spread work across thousands of working units. If your problem can be solved in that computing model, we're already at a point where we can just throw more area at the problem, or spread it across multiple chips trivially.

The single-threaded performance wall is a hardware issue. The multi-threaded performance wall is a wetware (that's us) issue.

throw table_exception("(? ???)? ? ???");

we'd just have to throw out all our software and start again -- this time teaching everyone multi-core practices from the get-go, so that we can make 100 90's era CPU cores onto a chip and actually have them be utilized (which was the idea behind larabe)

The thing is, while the idea of throwing all the software away isn't workable the fact is that really "we" should be rethinking the way software engineering is taught.

There needs to be a push away from the 'OOP for everything!' mindset and towards one which highlights OOPs time to shine while also exposing people to functional programming styles and teaching them to think about data too so that people have a better understanding of the various ways problems can be solved and we don't get people trying to fill GPUs with C++ virtual monsters which burn resources doing vtable look ups when they really don't need it.

I admit I've been out of the educational circles for a while but if they are still turning out 'OOP are the bestestest!' "programmers" every year we have no chance of over coming the problem.
Advertisement

So should more computer science majors be focused on multi-core/multi-processor programming is that is the primary way stuff will be heading in the next 20-40 years? Stuff such as distributed systems and concurrency control/parallel computation.

The thing is, while the idea of throwing all the software away isn't workable the fact is that really "we" should be rethinking the way software engineering is taught.

There needs to be a push away from the 'OOP for everything!' mindset and towards one which highlights OOPs time to shine while also exposing people to functional programming styles and teaching them to think about data too so that people have a better understanding of the various ways problems can be solved and we don't get people trying to fill GPUs with C++ virtual monsters which burn resources doing vtable look ups when they really don't need it.

I admit I've been out of the educational circles for a while but if they are still turning out 'OOP are the bestestest!' "programmers" every year we have no chance of over coming the problem.

I can back this. More and more frequently OOP isn't the most optimal way to go about things. Being stuck in an OOP mindset can be very damaging in that regard.

I think we should just throw some more information at the compilers so they can do the threading...

That means telling it which calls to 3rd party stuff are dependent/thread safe and define the different tools needed for threading (similiar to how you can override new)

Of course the programmer will always be able to have more information than the compiler so doing the threading manually might get performance benefits. But if the compiler does it, it will thread everything without mistakes if done correctly.

I believe there was an intel compiler (experiment?) for c++ that does just that. Not sure how extensively it could do it.

But this probably works well only for our general normal CPUs with a bunch of cores, because if we go to more fine grained parallization i would imagine that more changes would need to be made at the algorithmic level which likely isnt very easy of a job for compilers.

o3o

I think we should just throw some more information at the compilers so they can do the threading...

In theory, this is one of the benefits of functional programming. Since the order in which things happen doesn't matter the compiler is free to rearrange the operations as it sees fit, including doing multithreading on its own if it feels like it. No idea how well do functional languages cope with this, though (at least current ones).

Of course the programmer will always be able to have more information than the compiler so doing the threading manually might get performance benefits. But if the compiler does it, it will thread everything without mistakes if done correctly.

This used to be the case with code optimization, where a human could generate better assembly than the compiler could. Then over time processors became a lot more complex and compilers became much more smarter, and most of the time a compiler is going to beat a human when it comes to optimization since it can see a lot more. I wouldn't be surprised if the same could be applied to making the compilers multithread the code.

Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.
Advertisement

I think we should just throw some more information at the compilers so they can do the threading...

In theory, this is one of the benefits of functional programming. Since the order in which things happen doesn't matter the compiler is free to rearrange the operations as it sees fit, including doing multithreading on its own if it feels like it. No idea how well do functional languages cope with this, though (at least current ones).

>Of course the programmer will always be able to have more information than the compiler so doing the threading manually might get performance benefits. But if the compiler does it, it will thread everything without mistakes if done correctly.

This used to be the case with code optimization, where a human could generate better assembly than the compiler could. Then over time processors became a lot more complex and compilers became much more smarter, and most of the time a compiler is going to beat a human when it comes to optimization since it can see a lot more. I wouldn't be surprised if the same could be applied to making the compilers multithread the code.

I feel like unless it was JITed you would get situations where it might give multiple actions the same weight despite them running much more/less frequently than the compiler expects. The compiler would need more data about the expected use of the program along with the code you are giving it.

I think we should just throw some more information at the compilers so they can do the threading...

Of course the programmer will always be able to have more information than the compiler so doing the threading manually might get performance benefits. But if the compiler does it, it will thread everything without mistakes if done correctly.

I believe there was an intel compiler (experiment?) for c++ that does just that. Not sure how extensively it could do it.

Unfortunately, the two bold bits are at odds with each other -- if we're supplying the extra info, then we can still make mistakes wink.png

Also, generic C++ isn't conducive to automatic concurrency -- not even on a single thread!
For example:


struct ArrayCopyCommand : public Command
{
  int* from;
  int* to;
  int count;
  virtual void Execute()
  {
    for( int i=0; i!=count; ++i )
      to[i] = from[i];
  }
};

The compiler will already try to pull apart this code into a graph of input->process->output chunks (the same graph that's required to generate fine-grained parallel code), in order to generate optimal single-threaded code. Often, the compiler will re-order your code, because a single-threaded CPU may be concurrent internally -- e.g. one instruction make take several cycles before it's result is ready, the compiler wants to move that instruction to be several cycles ahead of the instructions that depend on those results.

However, the C++ language makes this job very tough.
Take the above code, and use it like this:


ArrayCopyCommand test1, test2;
int data[] = { 1, 2, 3 };
test1.from = data;
test1.to = data+1;
test1.count = 2;
test1.Execute();
 
test2.from = data;
test2.to = &test2.count;
test2.count = 42;
test2.Execute();

In the case of test1, every iteration of the loop may in fact be dependent on the iteration that came before. This means that the compiler can't run that loop in parallel. Even if you had a fantastic compiler that can produce multi-core code, it has to run that loop in sequence in order to be compliant to the language.

In the case of test2, we can see that the loop body may actually change the value of count! This means that the compiler has to assume that the value of count is dependent on the loop body, and might change after each iteration, meaning it can't cache that value and has to re-load it from memory every iteration, again forcing the code to be sequential.

As is, that ArrayCopyCommand class cannot be made parallel, no matter how smart your compiler is, and any large C++ OOP project is going to be absolutely full of these kinds of road-blocks that stop it from being able to fully take advantage of current/future hardware.

To address these issues, it's again up to us programmers to be extremely good at our jobs, and write good code without making simple mistakes...

Or, if we don't feel like being hardcore C++ experts, we can instead use a language that is conducive to parallelism, like a functional language or a stream-processing language. For example, HLSL shaders look very much like C/C++, but they're designed in such a way that they can be run on thousands of cores, with very little room for programmers to create threading errors like race conditions...

Moore's law will probably continue for a while longer. There have been demonstrated 7 atom transistors and single atom transistor. Quantum tunneling did offer a hurdle, but it's seeming like less of a hurdle nowadays from what I've read. I remember when Wikipedia stopped their fabrication pages at 11 nm since that was believed to be the smallest they could go. Then when new research was made they changed the page to 10 nm. Then when more research was done they added 7 nm and then 5 nm. Humans are notoriously bad at predicting the future.

As Hodgman said layers is the newer idea. Stacking transistors and building 3D networks. I think a core idea of this will be fast interconnects at that scale. Processors are already extremely tiny. Xeon Phi for instance is increasing the transistor counts by just moving away from the socket methods. We can easily double the number of transistors in a chip by adding two of them on a board. Keeping doing that and you end up with Xeon Phi's 62 cores which has 5 billion transistors per card at 22 nm. Intel's already designing for 10 nm. Once there it'll be two more steps then they'll be working with a handful of atoms.

One thing I think that is going to happen though in relation to processing speed is a huge drop in the number of simple instructions. A switch to a purely SIMD architecture with highly CISC instructions for specialized operations. (Stuff like multiply 1 number is the same cost as multiplying 16 at the same time). We're already seeing that with encryption and compression algorithms which have special instructions so that they run way faster than with the basic RISC set of instructions. I think at one point we'll start to see stuff like hundreds of matrix/quaternion operation instructions and we'll continue to see speed improvements possibly doubling for a while as pipelines are redesigned to speed things up. That or specialized FPGA instructions as transistor count would allow one to program their own cores on a processor for running calculations or whole algorithms. I digress. I don't think the limit if/when we reach it in our life-time will be a big deal. There are so many avenues to research that we probably won't see it in our life-time.

This topic is closed to new replies.

Advertisement