The GPU will never be used for I/O operations
Depends on the kind of I/O you're talking about modern GPUs have a dedicated asynchronous DMA controller, which in a HUMA system, could be used by the CPU to implement an asynchronous memcpy operation. Some of them can even do higher level processing as they move data around -- such as read a blob of bytes from this mapped address, run JPEG decompression on those bytes and write the resulting pixels to this other mapped address. Streaming linearly-indexed pixel arrays into a destination that uses tiled-indexing is another complex IO op that the DMA controller might be able to do for free.
With 'branchy' code or code below a certain threshold of parallelism CPUs become better suited to the problem as they can run a single branchy thread fast.
Regarding branchy code - modern GPUs can almost branch for free now a lot of the time (the branch setup happens in parallel with your ALU, so usually you'd be bottlenecked by ALU and get the branch setup at no extra cost) -- with the obvious caveat that if half of a SIMD vector takes one path and the other half of the SIMD vector takes the other path, then you've got to execute both. This is the same as with SSE or AVX code on the CPU though (e.g. such as in ispc, which is a really cool language BTW), except the GPU probably handles these cases better and that your CPU is probably 4, 8 or 16 wide, while your GPU is probably 32 or 64 wide.
Really, what we also need, is a third class of processor in the same line as a SPU; something which can chew data quickly but can branch reasonably well and doesn't require loads of threads to keep it busy.
No consumer PC will have one... but does the Xeon Phi (Larrabee) fit the bill? It sounds like an x86 version of the SPEs, with way more cores, more memory per core, 16-wide AVX and 4 HW-threads per core.
Is there a standard emerging that indicates what parallel / concurrency programming will look like in the future?
From the Parallella discussions I got the impression that a lot of effort has to go into very tailored concepts / solutions.
Will programmers need to get good at designing for concurrency ... or are there concepts that might put the parallel execution under a hood of some kind ...
A skill-set that is relevant now and into the future is writing functional style code -- this doesn't mean you have to run off and learn Haskell (I sure haven't!), you could keep using C, or pretty much any language, as long as you can get into a situation where you always know exactly what ranges of data are being used as input, and what ranges of data are being used as outputs at any one time.
Pure functions are implicitly thread-safe (assuming their input and output buffers aren't also being written by another thread, of course).
Structuring your code as a big directed acyclic graph of [Inputs -> Process -> Output] nodes means that it will be portable to pretty much any kind of parallel CPU. If your game frame is made up of thousands (or tens of thousands) of these little jobs, then you can run your game on a single-core up to a 32-core (about the best a consumer can find atm) and it will keep scaling.
Pretty much every game engine I've used in the PS3/360 generation has at least started to transition to supporting this style of processing, as it will run well on old desktops, new desktops, the PS3's SPUs (assuming sizeof(inputs)+sizeof(outputs) is < ~128KB...), the 360's tri-core, and next-gen consoles.
This is also how pretty much all graphics and compute programming works on GPUs - with the extra requirement that your processes should map well to a SIMD processor... but to get the most out of a modern CPU your workloads would ideally all map well to SIMD as well - it's just not quite as important as with a GPU.