Advertisement

The Next Huge Leap in Computing Power?

Started by April 09, 2014 04:21 AM
43 comments, last by Prefect 10 years, 4 months ago

Is there a standard emerging that indicates what parallel / concurrency programming will look like in the future?

From the Parallella discussions I got the impression that a lot of effort has to go into very tailored concepts / solutions.

Will programmers need to get good at designing for concurrency ... or are there concepts that might put the parallel execution under a hood of some kind ... like automatic shared-memory multiprocessor systems for example? Or are those inherently inefficient?

Given enough eyeballs, all mysteries are shallow.

MeAndVR

I'd also like to point out that they are designed to solve completely different problems. A CPU is optimised for solving a variety of sequential problems, while the GPU is optimised to solve an incredible amount of identical parallelisable problems.


This.

GPUs are fast at what they do because they can run 32 or 64 threads in lock step AND many groups of this to hide latency. A high end GPU today keeps 1000s of threads in flight at once to keep the beast fed which makes them very good at embarrassingly parallel tasks and latency hiding but.. erm.. that's about it.

With 'branchy' code or code below a certain threshold of parallelism CPUs become better suited to the problem as they can run a single branchy thread fast. Not to say we couldn't do better here, out-of-order execution and intelligent prefetches hideth a multitude of programmer sins after all but they are still better than launching a single wave front on a single CU on a GPU to use a single thread of that wave front to do some work.

Really, what we also need, is a third class of processor in the same line as a SPU; something which can chew data quickly but can branch reasonably well and doesn't require loads of threads to keep it busy. Certain workloads would suit this kind of processor nicely, like they did in the PS3 days, without all the overhead of the CPU 'guessing' and the GPU launching more threads than required.

But even without that third class 'the future', as it were, is already here with CPUs with iGPUs attached; really what is needed is for them to become useable as just ALU arrays without a display attached then software can begin making better use of them.

The final problem however is what it has been for some years now; memory.
Memory is too slow by clock cycle standards with L1 and L2 taking between 3 and early/late teens of cycles to fetch from and god forbid you miss the lowest level of cache and outfox the pre-fetcher as 100s of cycles then go missing while you stall for data.

The next leap, or at least improvement, really needs to come from the memory side of things because that is forming the biggest bottlenecks these days - we are drowning in ALU power, just can't get the stuff to work on in the right place at the right time.
Advertisement

The GPU will never be used for I/O operations

Depends on the kind of I/O you're talking about wink.png modern GPUs have a dedicated asynchronous DMA controller, which in a HUMA system, could be used by the CPU to implement an asynchronous memcpy operation. Some of them can even do higher level processing as they move data around -- such as read a blob of bytes from this mapped address, run JPEG decompression on those bytes and write the resulting pixels to this other mapped address. Streaming linearly-indexed pixel arrays into a destination that uses tiled-indexing is another complex IO op that the DMA controller might be able to do for free.

With 'branchy' code or code below a certain threshold of parallelism CPUs become better suited to the problem as they can run a single branchy thread fast.

Regarding branchy code - modern GPUs can almost branch for free now a lot of the time (the branch setup happens in parallel with your ALU, so usually you'd be bottlenecked by ALU and get the branch setup at no extra cost) -- with the obvious caveat that if half of a SIMD vector takes one path and the other half of the SIMD vector takes the other path, then you've got to execute both. This is the same as with SSE or AVX code on the CPU though (e.g. such as in ispc, which is a really cool language BTW), except the GPU probably handles these cases better laugh.png and that your CPU is probably 4, 8 or 16 wide, while your GPU is probably 32 or 64 wide.

Really, what we also need, is a third class of processor in the same line as a SPU; something which can chew data quickly but can branch reasonably well and doesn't require loads of threads to keep it busy.

No consumer PC will have one... but does the Xeon Phi (Larrabee) fit the bill? It sounds like an x86 version of the SPEs, with way more cores, more memory per core, 16-wide AVX and 4 HW-threads per core.

Is there a standard emerging that indicates what parallel / concurrency programming will look like in the future?
From the Parallella discussions I got the impression that a lot of effort has to go into very tailored concepts / solutions.

Will programmers need to get good at designing for concurrency ... or are there concepts that might put the parallel execution under a hood of some kind ...

A skill-set that is relevant now and into the future is writing functional style code -- this doesn't mean you have to run off and learn Haskell (I sure haven't!), you could keep using C, or pretty much any language, as long as you can get into a situation where you always know exactly what ranges of data are being used as input, and what ranges of data are being used as outputs at any one time.
Pure functions are implicitly thread-safe (assuming their input and output buffers aren't also being written by another thread, of course).
Structuring your code as a big directed acyclic graph of [Inputs -> Process -> Output] nodes means that it will be portable to pretty much any kind of parallel CPU. If your game frame is made up of thousands (or tens of thousands) of these little jobs, then you can run your game on a single-core up to a 32-core (about the best a consumer can find atm) and it will keep scaling.
Pretty much every game engine I've used in the PS3/360 generation has at least started to transition to supporting this style of processing, as it will run well on old desktops, new desktops, the PS3's SPUs (assuming sizeof(inputs)+sizeof(outputs) is < ~128KB...), the 360's tri-core, and next-gen consoles.

This is also how pretty much all graphics and compute programming works on GPUs - with the extra requirement that your processes should map well to a SIMD processor... but to get the most out of a modern CPU your workloads would ideally all map well to SIMD as well - it's just not quite as important as with a GPU.

Quantum computing would of course be useful, in a generic, all-embracing sort of way, but I don't think we'll be seeing quantum coprocessors anytime soon. Most likely the tech will start out slow, expensive and flimsy, just like computers were in the 50's (quantum computing isn't even born yet, it's trying to but there are still colossal engineering challenges to overcome). If one day it becomes as accessible and user-friendly as, say, a modern GPU, then there would definitely be applications in many different areas of software development and computer science in general, and once the tech is mature some problems would naturally be very well-suited to quantum computing and would see a huge speedup. But of course quantum computing in and of itself is not a substitute for classical computing.

I'm thinking the future is probably going to tend towards a unified software/hardware interface for high-speed serial processors (CPU's), massively parallel compute units (GPU's), and any other hardware devices, that will be able to interoperate with one another without having to go through a centralized control unit like we have now. Or something. The point is these hardware devices are not equivalent, they solve relatively different problems, they work well as a hybrid solution but are no good individually, so it makes sense to capitalize on this and try to bring them closer rather than try and turn one into the other.

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”

Depends on the kind of I/O you're talking about wink.png modern GPUs have a dedicated asynchronous DMA controller, which in a HUMA system, could be used by the CPU to implement an asynchronous memcpy operation. Some of them can even do higher level processing as they move data around -- such as read a blob of bytes from this mapped address, run JPEG decompression on those bytes and write the resulting pixels to this other mapped address. Streaming linearly-indexed pixel arrays into a destination that uses tiled-indexing is another complex IO op that the DMA controller might be able to do for free.


Your example with the JPEG decompression is interesting, but I see one flaw. The JPEG file on the storage medium is accessed sequentially (assuming you wish to decompress it while streaming it from disk), so there is no benefit in doing parallel decompression. That is of course assuming that access speed to the storage medium is far slower than the GPU, which is currently the case.

If the uncompressed JPEG file were already loaded in memory, then yes, a hUMA system would greatly benefit from GPU accelerated decompression. One might even go as far as to say decompressing it on the fly during every refresh of the screen is more efficient than keeping the decompressed form in memory, i.e. saving memory at the cost of some speed.

My original quote was directed more towards I/O operations such as access to storage mediums, keyboard/mouse/joypad input, etc.
"I would try to find halo source code by bungie best fps engine ever created, u see why call of duty loses speed due to its detail." -- GettingNifty

My original quote was directed more towards I/O operations such as access to storage mediums, keyboard/mouse/joypad input, etc.

And yet they perform I/O quite well. I/O is not limited to those few devices. Many devices are primarily output: printers, sound cards, radios, haptic devices, and yes, graphics. You probably have a half dozen output-only I/O systems on your computer right now, probably with at least 3 in use.

The newest video cards supporting 4k video or DisplayPort protocols can handle about 6GBps output (or about 50Gbit, if you prefer). As far as I/O devices go, video cards are usually the fastest I/O devices on a computer.

Advertisement

Your example with the JPEG decompression is interesting, but I see one flaw. The JPEG file on the storage medium is accessed sequentially (assuming you wish to decompress it while streaming it from disk), so there is no benefit in doing parallel decompression. That is of course assuming that access speed to the storage medium is far slower than the GPU, which is currently the case.

It doesn't use the GPU's parallel compute units (multi-core/SIMD), it's done by the DMA unit.
It's parallel in the sense that the CPU can request a JPEG to be loaded into a pixel array, and the DMA unit does the work asynchronously. The CPU can go off and do other useful work in the meantime while the data is being streamed from disk into the pixel array (with the actual JPEG decompression logic incurring zero cost on either the CPU or the GPU's compute/graphics units).


phantom, on 09 Apr 2014 - 2:05 PM, said:

Really, what we also need, is a third class of processor in the same line as a SPU; something which can chew data quickly but can branch reasonably well and doesn't require loads of threads to keep it busy.

No consumer PC will have one... but does the Xeon Phi (Larrabee) fit the bill? It sounds like an x86 version of the SPEs, with way more cores, more memory per core, 16-wide AVX and 4 HW-threads per core.

Can't the Xena coprocessor on the Amiga X1000 also be used in this fashion although nowhere near as powerful.

I know Euclideon isn't a very popular company here, but I personally believe they are most likely the "Next Huge Leap". They have demonstrated their tech is real for Geospatial and more than 15 companies have signed on to use their technology, so it isn't fake. When it comes to the gaming world or more appropriately, "IF" it comes to the gaming world, I think that's the biggest leap we'll see.

The second thing I believe could be the "Next Huge Leap", would be stacked chips. The fact that mobile has created such a huge divide between what is needed for phones/tablets and what people need for PCs is being blurred by SoCs by the day. Tegra, A Series APUs, Atom...etc., are all being designed for wider application stuff now. If stacking chips becomes truly the route everyone takes, then every device sees improvement, without such a divide anymore.

The third thing I believe could be the "Next Huge Leap", would be the cloud. If stacking chips becomes the route to take, which I believe almost everyone agrees with, then the switch to more CPU bound processing would greatly be benefited from the cloud.

Microsoft built DX12 and apparently focused more on CPU utilization more so than GPU kind of implies that may be the route alot of people take in the future. If they believe the cloud would work hand in hand for everyday use, that could be a huge leap.

Personally, I think if Euclideon is real, which they've stated is run entirely on the CPU, combined with stacked chips for CPUs and throw in Cloud Computing for good measure to offload CPU tasks to, would be not just a huge leap, but the "Ultimate Leap".


I know Euclideon isn't a very popular company here, but I personally believe they are most likely the "Next Huge Leap".

Meh, they still haven't actually released anything. Their "unlimited detail" as described could be easily implemented using CLOD techniques developed decades ago. Preprocess the voxels into view-dependent bricks and load them up when needed. Their first demos were in 2003, back when preprocessed CLOD and geo-morphing was still the best solution as programmable 3D cards were just barely released. These days the preprocessed tree generates 3D textures with a fancy shader and accomplishes the same thing.

Maybe once they actually release a product we can see how awesome it is. Until then it remains vaporware ... 11 years after being announced.

This topic is closed to new replies.

Advertisement