Quote:
Original post by Anonymous Poster
well i dont know what kylotan was referring to exactly, but one obvious example is that in modern gpus the ideal ratio of arithmetic ops to texture ops is 3:1 or greater (and i read the average for shader programs now is like 5:1, so this is only going to get more extreme, and may have already), while many physics algs (solvers, relaxation techniques) approach 1:1.
Good remark. But for GPUs that ratio is texture accesses, not generic memory access. Texture sampling units are expensive in terms of chip area, so for example the Radeon X1900 has 48 shader units and 'only' 16 texture samplers. But nothing limits them to adding many channels for reading from memory directly. In fact that's what Direct3D 10's 'load' instruction is for. Also look at Direct3D 9's constant registers. They could be considered an early form of memory access, and I'm sure they're accessed 1:1 on average.
Quote:
in addition, the memory access patterns are very different...to get optimal performance out of a gpu you need fairly coherent memory access patterns (texture coords don't usually jump around very much), whereas in physics you have no guarantee of such a pattern, in fact you often need very incoherent memory access.
As far as I know, latencies for texture sampling are already much higher than arithmetic instructions, and GPUs know how to deal with that. In fact the Radeon X1900 uses a technique called
Ultra-Threading. What it comes down to is that while waiting for a long latency instruction, other pixels can execute their arithmetic instructions.
There are little details about PhysX's approach to memory access latency hiding. What's known is that they don't use a cache so in the
best case they use something similar to ATI's Ultra-Threading (this would explain PhysX's 'internal memory').
Also, let's not forget that for direct memory accesses Direct3D 10 cards will in the worst case only have to deal with memory latencies (1.2 ns nowadays), not sampler latencies. So they'll definitely be able to handle physics processing effciently.
Quote:
there's a big difference between if a program can run on a platform and if it can run efficiently. all the theoretical flops in the world won't help you if your processor is waiting for data.
I couldn't agree more. GPUs definitely have extreme SIMD units well suited for physics but they need to be kept busy by efficiently feeding them with data. Now, nobody knows exactly what next-generation GPUs will look like, but I think we would gravely underestimate them to think they won't have at least the same capabilities as a PPU. Direct3D 10 has been in development for many years now and there are strong indications that R600 and G80 will have radically new architectures. The Radeon X1900 and Xenos probably show the best view of where the future is headed.
By the way, memory access is one of the reasons why I wouldn't rule out multi-core CPUs for reasonably efficient physics processing. They don't have the high number of SIMD units (although Core 2 is extremely impressive compared to Pentium 4), but they have a high clock frequency and most of all memory access with very low latency thanks to large caches.