Note: this is going to be a high level argument because I believe the technical details arn't necessary to prove this point.
I personally don't think it's fair to assume a dual core to a quad core will just simply double the processing power. If cache takes up one of the biggest percentages of the die imagine fitting 2 times as much cache and 2 more cores on to the chip... isn't going to happen. In the end you're going to have 4 cores causing more memory misses. Plus this has a limit too. Once they reach about 8 cores adding more than that will only hinder the processor due to scheduling access to shared hardware, synchronization and idle cores.
Also you keep saying, "was the code fully optimized?" about every demo that is presented here. At the same time, the hardware for the PPU is in the first years of development. When that is 'fully optimized' and improvements are made based on developer feedback and analysis it is going to out perform the CPU by an even larger margin without the need to spend hours trying to make the code as optimized as it can get.
Saying a CPU can be as fast as a dedicated piece of hardware makes no sense. A simple example is the DES cracking algorithms. You can run those in software... sure. But you won't get anywhere nearly as fast as you would using the specifically designed chips that do it. I know this is a far fetched example but it's a simple matter of designing hardware to take advantage of the specifics of what you want it to do. You simply can't optimize a CPU to do that because it needs to be general. You agree that the GPU is better than the CPU because of certain aspects in its architecture but that exact same argument can be made for why the PPU is better than the CPU for doing physics. I feel like the CPU loses hands down based on your own argument for why the GPU is necessary in the first place.
As far as using a graphics card to do physics... yes you can probably do it. But you're just hindering both aspects of the game. The graphics will become worse because they will have to a) take on the extra load of the physics calculations (which will only increase as games get better) and b) have to keep the hardware even MORE general than what would be necessary for the unified shaders (which we have seen is a bad thing due to the CPU argument). The physics will become worse because it will have to coincide on the graphics hardware (graphics being the primary goal and function of the device) and will not be able to be optimized. As these fields progress, if they can remain separate, the architectures and ways of solving both graphics and physics problems will probably change. BUT they won't be changing in the same direction. The PPU is looking an awful lot like a GPU because that is the starting off point and is the safest place to begin development of a PPU. Over the years, if given the chance, I bet the PPU will end up looking different than the GPU... and both will be much better and MUCH more optimized because of it.
The price IS a good point, and I agree with the fact that at the moment it's not the best cost effective solution. But that has nothing to do with the fact that the PPU is better than the CPU and that putting physics on a GPU would only hinder the development of both graphics and physics hardware in the future.
I really like the PPU idea... I just wish it were most cost effective :'(
physx chip
Quote: Original post by Anonymous Poster
can you break down those stats?
how can a pentiumn duo perform a 48 Gflops when all bench marks a had ever seen say the at best a dual pentium can do 5 to 8
I wasn't talking about a dual-core Pentium 4 but a Core 2 Duo. 3 Ghz, 2 cores, 2 SSE units, 4 floats, that's a maximum of 48 GFLOPS.
Just to illustrate that this is revolutionary and totally breaks with the past: A Pentium 4 has only 1 SSE unit that can process 2 floats per clock cycle. And if you consider that Core 2 Duo aims to bring dual-cores to the mainstream market with very affordable prices I think it's reasonable to say that peak floating-point processing power has become eightfold faster from one generation to the next. Ok, in practice it's hard to use that peak performance but even then, this is nothing short of a revolution. Forget Pentium, in a few months Intel will deny that it ever existed. ;-)
Quote: even a zeon 2.4 can only be 300 megflops and peak at around 2.2 gflops with extremely optimized sse tight loops with 100% data in catch.
Yes, it much depends on the application and how it was optimized. In my experience a highly parallel task like 3D rendering optimized with hand-tuned SSE and using specialized data structures yields a big percentage of that theoretical sustainable performance. So it takes considerable effort, but I've seen applications become several times faster by optimizing just the main hotspots with SSE and revising the data layout for cache efficiency.
Quote: also how can you deduct the PPU is 20 gflops, as far as I know not specifications had been given publically.
I read that a nVidea 6800 can do about 100 gFlops
AGEIA specifies 20 giga-instructions.
Quote: Original post by KulSeran
C0D1F1ED, yes, that was with full SSE instructions for the Vector math.
The code I was playing with
Very nice demo!
Quote: It could be an issue with it not being set up properly, but I dont consider 8fps that much "real time".
I took the liberty of profiling it with CodeAnalyst, and found practially only single scalar SSE code. Also, in the source code I only found vec3_add to use inline assembly (in a very unoptimal way).
So I assume you just used the compiler switch to use SSE? Unfortunately that doesn't use SSE effectively at all. Automatic vectorization is still in its early development so you end up with mostly scalar instructions (In fact you have to use the __m128 data type before it can use vector SSE). Scalar SSE is almost no better than x87. Also, I notice you use 3 component vectors. This isn't optimal either considering SSE uses 4 components. The trick is to reorder your data from 'Array of Structurs' to 'Structure of Arrays' (or a hybrid). Here's a tutorial explaining some of the concepts: SSE/SSE2 Toolbox - Solutions for Real-Life SIMD Problems.
There's also good news though. There are hardly any L2 cache misses. This is logical for just 2000 particles but it means that you have great opportunity to make this demo much faster without bumping into cache limitations!
Quote: as for the GPU memory access issues, I have yet to see a gpu with good response to something like glReadPixels.
It appears to me that the GPU is meant to take data and process it, not take data, process and return it.
I would apriciate information to the contrary, though. Since it would be nice to know things have changed.
The problem with glReadPixels is that it forces the GPU to finish all rendering operations, send the data to system RAM, and then waits idle for new data to arrive and fill the pipelines. So what you can do to improve this is supply new data before reading back the results of the previous batch of data. Getting the most out of it is not an easy task but there's absolutely no hardware limitation on the actual data transfer here, just a typical synchronization problem. And the exact same thing happens with physics hardware (except that it does have limited transfer bandwith), and multi-threaded software. Anyway, looking at the Direct3D 10 SDK documentation there are a lot of interesting features to make this easier, like streaming output buffers and virtualized memory.
The days of getting higher performance for free are over. ;-) Well that's not exactly true but getting the maximum out of it really takes some effort from the application programmer. We almost take it for granted that drivers and embedded software use highly optimized hand-tuned assembly code, but accept high-level C++ with little care for synchronization as the performance standard for applications.
Just want to throw in something that Richard Huddy say last week on a „Introduction to Direct3D 10“ meeting. Form his point of View the G in GPU will no longer stands for graphics only. The new meaning will be “general”.
The only advantage I can see on PhysX is the 64 Bit MIPS Cpu core that is used to run the whole physic engine independent of the main CPU. If you run physic on the GPU the main CPU still have to generate and send the instructions down to it. But a MIPS core is not that large at all and as it could offload main CPU work for graphics operations too. Maybe we will see something like this in future GPUs.
Anyway freeing the main CPU core from large batch able jobs like collision detection is a good direction but adding another coprocessor like PhysX makes anything only more complex. It would be much better to stay with a limited numbers of chips and memory blocks.
The only advantage I can see on PhysX is the 64 Bit MIPS Cpu core that is used to run the whole physic engine independent of the main CPU. If you run physic on the GPU the main CPU still have to generate and send the instructions down to it. But a MIPS core is not that large at all and as it could offload main CPU work for graphics operations too. Maybe we will see something like this in future GPUs.
Anyway freeing the main CPU core from large batch able jobs like collision detection is a good direction but adding another coprocessor like PhysX makes anything only more complex. It would be much better to stay with a limited numbers of chips and memory blocks.
Quote: Original post by Dabacle
Note: this is going to be a high level argument because I believe the technical details arn't necessary to prove this point.
Right...
Quote: I personally don't think it's fair to assume a dual core to a quad core will just simply double the processing power. If cache takes up one of the biggest percentages of the die imagine fitting 2 times as much cache and 2 more cores on to the chip... isn't going to happen.
Actually that's exactly what's going to happen. When going from 130 nm technology to 90 nm technology, the number of transistors that fit onto the same die space doubled. When going from 90 nm to 65 nm, again two times more transistors fitted on the same space. So we can expect quad-cores on this process without cutting out on anything. And Intel is readying 45 nm production as we speak so compared to 130 nm they'll be able to fit eight times more transistors on a die, making it suited for octa-cores. That's already on the roadmap.
And actually they're going to increase the number of execution units and cache density as well. AMD's quad-core K8L architecture will (just like Core 2) double the number of floating-point execution units. And a new technology called Z-RAM offers six times the density of SRAM cache memory and lowers latency.
Quote: Saying a CPU can be as fast as a dedicated piece of hardware makes no sense. A simple example is the DES cracking algorithms. You can run those in software... sure. But you won't get anywhere nearly as fast as you would using the specifically designed chips that do it.
DES is a fixed algorithm, so yes you can create hardware for it that is way faster than any programmable CPU. But you're forgetting that PhysX is a programmable processor as well. There isn't a fixed physics algorithm to optimize for. If there's any 'algorithm' it is specifically designed for it's IEEE 754. In other words plain 32-bit floating-point arithmetic just like a CPU or GPU.
Quote: You agree that the GPU is better than the CPU because of certain aspects in its architecture but that exact same argument can be made for why the PPU is better than the CPU for doing physics. I feel like the CPU loses hands down based on your own argument for why the GPU is necessary in the first place.
No, because a GPU has unique units for texture sampling and rasterization tasks. The PPU has nothing special to speed up physics. If a GPU's most powerful instruction is 'tex', then what would be the most powerful instruction of a PPU?
The only real advantage a PPU has over a CPU is its parallelism. But this won't be an advantage for long now the wide issue multi-core revolution has started.
Quote: The price IS a good point, and I agree with the fact that at the moment it's not the best cost effective solution. But that has nothing to do with the fact that the PPU is better than the CPU...
True, it has nothing to do with that. But it does have everything to do with the success of PhysX. If it doesn't become cost-effective soon then only a minor fraction of gamers will have one, game developers will have wasted effort supporting it and look for other solutions, and CPU/GPU manufacturers will deliver vastly increased floating-point processing power ready for physics.
May 27, 2006 06:57 PM
Quote: Original post by Dabacle
I personally don't think it's fair to assume a dual core to a quad core will just simply double the processing power. If cache takes up one of the biggest percentages of the die imagine fitting 2 times as much cache and 2 more cores on to the chip... isn't going to happen. In the end you're going to have 4 cores causing more memory misses. Plus this has a limit too. Once they reach about 8 cores adding more than that will only hinder the processor due to scheduling access to shared hardware, synchronization and idle cores.
(
That’s a nonsense, for year companies like HP, Compact, Cray and many others, had been building supercomputers with using Pentium4, AMD and G5 by placing together several thousand CPUs. And these computers are measure in the Teraflops.
What I do not get is whit it is the that dooms days fatality for parallel processing will only affect Intel (the CPU that ha Cache, the more advanced silicon technology known to man, branch prediction, extenuative execution, the faster clock speed) and the same idea will benefic so much a PPU (with non of the above)
Isn’t that what IMB did with the Cell?
I think that since Intel invented the CPU, many we should cut then some slack and see what the come on with.
Quote:
That’s a nonsense, for year companies like HP, Compact, Cray and many others, had been building supercomputers with using Pentium4, AMD and G5 by placing together several thousand CPUs. And these computers are measure in the Teraflops.
Good point I forgot about those. But those supercomputers arn't the same as a processor with, say, 16 cores. And they only function well on fully parallel problems like partial differentiation/integration, analyzing different outcomes of chess moves, brute force cracking an encryption or some problem that can be broken up into equal tasks that don't rely on each other. Try writing normal software on one of those and it would be a waste of all but a few cores anyway... Plus, when you add multiple CPUs together you get an equal amount of added cache. That linear growth wouldn't happen with adding more cores to a single processor.
Quote:
What I do not get is whit it is the that dooms days fatality for parallel processing will only affect Intel (the CPU that ha Cache, the more advanced silicon technology known to man, branch prediction, extenuative execution, the faster clock speed) and the same idea will benefic so much a PPU (with non of the above)
Isn’t that what IMB did with the Cell?
I think that since Intel invented the CPU, many we should cut then some slack and see what the come on with.
If you build the hardware from the ground up to do one thing in particular it will be faster than using general hardware to solve the problem. I agree completely that AMD and Intel are amazing and their ways of improving performance are (in most cases) elegant and beautiful. I'm sure they'll come up with many more advancements but gaming and physics arn't their priorities. Most likely running Windows and Microsoft Office are. If the CPU companies dedicated a core to physics processing they'd probably make an awesome physics core. But a general core won't be as good. It's the same reason why the GPU isn't on the CPU.
----------------------------Yes, I know Dabacle is spelled wrong! :P
Quote: Original post by Dabacle
Good point I forgot about those. But those supercomputers arn't the same as a processor with, say, 16 cores.
Octa-cores will still have 8 equivalent cores. And actually they're better than supercomputers because there's lower latency for data transfer between the cores, and main memory is central. What's going to happen after that is still undetermined but what's certain is that transistor density keeps doubling (32 nm, 22 nm, 17 nm, ...) so they either double the number of cores again no problem or try to use them more effectively with a different architecture (e.g. reverse Hyper-Threading).
Quote: And they only function well on fully parallel problems like partial differentiation/integration, analyzing different outcomes of chess moves, brute force cracking an encryption or some problem that can be broken up into equal tasks that don't rely on each other. Try writing normal software on one of those and it would be a waste of all but a few cores anyway... Plus, when you add multiple CPUs together you get an equal amount of added cache. That linear growth wouldn't happen with adding more cores to a single processor.
It will thanks to Moore's Law being perfectly safe for at least another decade. Futhermore, PhysX relies on "fully parallel problems" just as well. And writing efficient software (drivers) for it isn't easier than writing multi-threaded software for the CPU.
And to get back to reverse Hyper-Threading; the idea is to just run X threads on Y execution units (fading the lines between cores). If there's only one thread it has all execution units to itself (it won't be able to use them all effectively but it will still run significantly faster than on a single-core because it's never short of execution units), or running multiple threads with any mix of instruction types as fast as theoretically possible.
So I already hear you wondering: Can't PhysX use the same technology? Yes, it probably can, but not in any reasonable amount of time. Multi-billion companies like Intel/AMD/NVIDIA/ATI no doubt have already running experiments in their labs to get more out of TLP. A company like AGEIA would need at least five years to create something truely technically advanced, if they can find the R&D capital. Plus they'd need to find a way to reduce the price to about 100$ to enter the mainstream market long before that.
I'm sorry for them but it's just not going to happen.
Quote: Octa-cores will still have 8 equivalent cores. And actually they're better than supercomputers because there's lower latency for data transfer between the cores, and main memory is central. What's going to happen after that is still undetermined but what's certain is that transistor density keeps doubling (32 nm, 22 nm, 17 nm, ...) so they either double the number of cores again no problem or try to use them more effectively with a different architecture (e.g. reverse Hyper-Threading).
Oh I agree completely I was just saying that doubling the number of cores doesn't necessarily equal double processing power/execution speed that you would think you had using supercomputers and was saying that the supercomputer examples are not applicable for real solutions (due to the high latency and parallel nature). I'd also love to see them hit 32nm that would be impressive!
Quote: Futhermore, PhysX relies on "fully parallel problems" just as well. And writing efficient software (drivers) for it isn't easier than writing multi-threaded software for the CPU.
Hmm yeah it probably does. But the hardware is designed for it like the GPU. I'm not well versed in the specifics of the architectures but, again, I was bringing up the parallel nature to say why the supercomputers are extremely fast and also why they wouldn't necessarily speed up a normal system (windows, linux, whatever) by the 'Teraflops' measurements.
Quote: So I already hear you wondering: Can't PhysX use the same technology? Yes, it probably can, but not in any reasonable amount of time. Multi-billion companies like Intel/AMD/NVIDIA/ATI no doubt have already running experiments in their labs to get more out of TLP. A company like AGEIA would need at least five years to create something truely technically advanced, if they can find the R&D capital. Plus they'd need to find a way to reduce the price to about 100$ to enter the mainstream market long before that.
Also makes sense. But I feel like the PhysX chip wouldn't necessarily have to take advantage of the advancements used in a CPU because of it's specific nature. The CPU needs to do everything fast... the PPU needs to do only physics calculations fast. I'm sure if I ran a random OS on the PPU it would die... so the optimizations would help the hardware there. But we're talking about a piece of hardware designed to run an API written by the very same people that made the board. I feel like they can optimize the hell out of their code and their board better (for physics at least) than Intel/AMD can considering they have no clue what is running on their CPU core. I feel like Ageia has it easier as far as optimizing hardware/software goes.
Rant really begins now...
I guess I'm defending the PhysX chip because I feel like offloading some specialized functionality is not necessarily a bad idea. When it comes down to it, the people who need the high end graphics cards (aside from graphic designers/animators) usually would need high end physics too because they are playing games. If graphics companies got together with ageia and combined the two, that would be great. I wouldn't want the product to be called a graphics card anymore... Call it a Gaming Card or something. And I would rather the two pieces of hardware stay separate (even if on the same card which could be interesting...) because, as time goes on, I would hope they would grow and optimize in their own way and not have to still conform to the other. But who knows. Hell some of those graphics cards take up two slots anyway!
And with the entire, "soon they're going to have an AIPU!" Well, maybe... I mean gaming requires a lot more processing power (and usually this processing power can be pin pointed to a few specific areas: graphics (offloaded), physics (maybe offloaded), AI, ect.) than most typical applications. I feel like eventually there will be a whole different level of machine required to play most games that come out on the market (this is here already as far as high end settings go). I would love to see dedicated hardware making games even more realistic and amazing than they are today... If they could reduce the cost and maybe combine some of the processing units into the "Gaming Card" I would see it as being an inevitable step in the future of gaming hardware.
Ok rant over. Sorry :(
----------------------------Yes, I know Dabacle is spelled wrong! :P
Quote: Original post by Dabacle
But I feel like the PhysX chip wouldn't necessarily have to take advantage of the advancements used in a CPU because of it's specific nature. The CPU needs to do everything fast... the PPU needs to do only physics calculations fast. I'm sure if I ran a random OS on the PPU it would die... so the optimizations would help the hardware there. But we're talking about a piece of hardware designed to run an API written by the very same people that made the board. I feel like they can optimize the hell out of their code and their board better (for physics at least) than Intel/AMD can considering they have no clue what is running on their CPU core. I feel like Ageia has it easier as far as optimizing hardware/software goes.
Excellent points. And it's quite likely that even against Core 2 Duo the PPU has a performance advantage.
But I hope we can all agree that Core 2 Duo won't be a weak CPU for physics processing. If there's only a small factor of difference between the two then game developers will just aim for the performance level of a Core 2 Duo to cover a wide market. Without risking to dissapoint gamers with weak physics. After all they don't want to be framerate limited by the physics processing either (PPU or no PPU), as was first reported for GRAW and likely holds down CellFactor as well. If a gamer upgrades his graphics card he expects higher framerates in every game. Physics is not important enough for them to sacrifice on that and only few of them would consider paying another 250$ to solve the 'problem'. Therefore game developers will play safe and make sure they can run advanced physics on the CPU as well.
And from some of the analysis I've done it's clear that there's still a lot of headroom for CPU optimizations with SSE in particular as well. Even Intel itself could use all it's inside knowledge to help e.g. Havok optimize for Core 2 Duo specifically. Intel realizes the importance of the gamer market, it's what made AMD popular as well.
Quote: I guess I'm defending the PhysX chip because I feel like offloading some specialized functionality is not necessarily a bad idea.
It's not a bad idea at all. After all it's what made graphics cards popular. But unfortunately physics is not nearly as specialized as graphics. My whole point is that CPUs and/or GPUs can run physics ranging from "adequately" to "excellently". There's too little room left to justify spending 250$ on something that specializes only at physics. It's quite ironic that PhysX specializes on physics but physics is not as specialized as graphics and graphics cards can do physics but physics cards can't do anything but physics... Still with me? ;-) This is because a GPU has highly general shader units and highly specific texture and rasterization units in one. A PPU only has general floating-point processing.
Quote: When it comes down to it, the people who need the high end graphics cards (aside from graphic designers/animators) usually would need high end physics too because they are playing games. If graphics companies got together with ageia and combined the two, that would be great. I wouldn't want the product to be called a graphics card anymore... Call it a Gaming Card or something. And I would rather the two pieces of hardware stay separate (even if on the same card which could be interesting...) because, as time goes on, I would hope they would grow and optimize in their own way and not have to still conform to the other. But who knows. Hell some of those graphics cards take up two slots anyway!
Like Demirug said that's already where GPUs are heading. Direct3D 10 cards will have everything needed to do physics processing efficiently. For the same money you'll get more physics and graphics processing than a GPU + PPU. Even for those not shy on money: one card, two cards, four cards, it's always more bang for the buck.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement