Advertisement

physx chip

Started by April 20, 2006 07:42 PM
223 comments, last by GameDev.net 18 years, 8 months ago
Quote:
Original post by Saruman
Not at all, I ran the software version on a very powerful CPU and Aegia is very well known having one of the fastest Physics engines on the market in software as well... if not the fastest now with dual-core becoming commonplace.

Well I checked something and got shocking results...

I used AMD CodeAnalyst to profile NovodeX Rocket during the 'Seige 2' demo. It's a cool demo in my opinion and runs at 60 FPS on a Geforce 6600 (non-GT) and Athlon 64 X2 4400+ @ 2.4 GHz. So I expected it to be physics limited but it turned out that 60% of time was spend in the graphics driver and kernel, and only 30% in the NovodeX DLL. Of this 30%, half was spent in one function: NxBuildSmoothNormals. But the most shocking thing of all was that it didn't use a single SSE instruction, only x87!

Now, I don't know how old the Rocket demo is exactly and how things have evolved in the last months but clearly I haven't seen a fraction of what is actually possible on a CPU.
Quote:
I compiled it myself and ran it, nothing at all was crippled or changed it was exactly the same.

So did you get a chance to see if it was it fully SSE optimized? If you did, do they use the uber fast approximation instructions for division and square root or the full precision ones?
Quote:
Why can't you render a game completely in software on the CPU instead of using a midrange GPU? Dedicated hardware is much faster at processing than a general purpose processor as has been echoed throughout the thread.

Yes we've been through this and I believe you missed my answer or I wasn't clear enough. GPUs are much faster at graphics mainly because they have rasterization units (computing coverage masks and interpolating coordinates) and texture samplers that work fully pipelined and in parallel. But when I wrote swShader it quickly became clear that CPUs are actually surprisingly fast for the arithmetic shader instructions. Rasterization and texture sampling takes the bulk of execution time while making the shader longer (without extra textures) didn't have that much of an impact. And this is logical, a CPU's GFLOPS are almost directly comparable to a GPU's GFLOPS for arithmetic shader instructions. Obviously the current generation of GPUs has significantly more GFLOPS than any CPU but it's still very roughly comparable while other operations like sampling take ages for the CPU.

So, to repeat my original question: Why is it that in AGEIA's PhsyX demos the software version doesn't even get close to the PhysX version? And to clarify that with the above: It should be mostly SIMD operations, for which modern CPUs are not to be underestimated.
Quote:
What I saw was a game in development and not a demo at all, it was a full networked multiplayer FPS in development using PhysX software and hardware.

Fair enough. Can I ask what kind of revolutionary physics this game used? I'm mostly interested to know whether it was 'eye-candy' physics we could do without or 'gameplay' physics that is crucial and absolutely requires high processing power. Also, was this physics workload present all of the time or only periodically? Thanks.
Quote:
Original post by C0D1F1ED
I used AMD CodeAnalyst to profile NovodeX Rocket during the 'Seige 2' demo.

Quite honestly I've never run the Rocket demo and due to not seeing anything first hand I am unable to comment on that.

Quote:
So did you get a chance to see if it was it fully SSE optimized? If you did, do they use the uber fast approximation instructions for division and square root or the full precision ones?

I'm not sure what you mean here? I meant that I compiled the actual game source that uses the actual PhysX SDK, I didn't have source to the API or anything like that.

Quote:
Fair enough. Can I ask what kind of revolutionary physics this game used? I'm mostly interested to know whether it was 'eye-candy' physics we could do without or 'gameplay' physics that is crucial and absolutely requires high processing power. Also, was this physics workload present all of the time or only periodically? Thanks.

I can't really say much other than that the game had both gameplay and eye-candy physics.
Advertisement
Everyone, to be technically honest, geometry and linear algebra are not physics.

Yet both the GPU and PPU use these subjects in the process of physical simulation.

What is the difference? The PPU is an overclocked headless GPU when the ratio of complexity is taken into consideration.

Anti-Wow.

Is Ageia public? If not, then no-one gets to know who really has a stake in its future.
Quote:
Original post by Saruman
Quite honestly I've never run the Rocket demo and due to not seeing anything first hand I am unable to comment on that.

It's a free download. CodeAnalyst is also for free and if you have an Intel CPU you can use VTune for 14 days.
Quote:
I'm not sure what you mean here? I meant that I compiled the actual game source that uses the actual PhysX SDK, I didn't have source to the API or anything like that.

I assumed you had the chance to profile the latest SDK with an actual game. You don't need the source to have a look at the code. In CodeAnalyst I simply double-clicked on the hotspots. I expected to see SSE instructions but it was x87 code with expensive divisions and such. I also set an event trigger to see if has any SSE code at all. Nope.
Quote:
I can't really say much other than that the game had both gameplay and eye-candy physics.

I understand. Thanks anyway!
Quote:
Original post by Anonymous Poster
huh? just because they have generalized buffers does not mean memory access mechanisms or patterns are going to change.

They will definitely optimize unsampled memory access. If it would use the texture units it would be a big waste. The R580 is already greatly hampered by its low number of texture units. Lots of big surfaces like walls still use shaders with a high ratio of texture samples. Also, with a unified architecture they have to be able to execute vertex, geometry and pixel shaders. So even though pixel shaders might evolve to 5:1 ratio they clearly need massive bandwidth without sampling for vertex/geometry processing.

So I'm sorry but memory access mechanisms and patterns are most definitely going to change. And GPGPU processing, including physics, will benefit.
Quote:
the only real changes that are going to happen are adding the geometry/primitive shader stage and to make it easier to shuffle memory around between stages. youre still going to be accessing memory coherently (nice thing about textures and vertex buffers) and not that often (in the case of shaders)

Even for coherent accesses the latency of texture sampling is high. Just think of anisotropic filtering, now considered standard. It uses several bilinear samples so even if all accessed texels are in the texture cache it's going to take many clock cycles. Still, we observe that the performance hit of anisotropic filtering is very modest for modern hardware. So GPUs must be excellent at hiding long latencies. Ultra-Threading as ATI calls it is one effective approach.

And again the essential part of it all for this discussion is that this is very beneficial for physics processing. They just have to combine it all into the next-generation Direct3D 10 graphics hardware.
Quote:
what do you think you'll be doing with those bound buffers? accessing them in shaders, at an average of 5:1 arithmentic ops to texture ops, or you'll just have a stalling alu...

They are not texture operations. They don't have to use the expensive sampling units at all. They are memory operations and yes they can read from textures but it's totally different from sampling. If they do things right they could easily have a 1:1 ratio without stalling any ALU. The latencies are likely higher than arithmetic operations but that's where Ultra-Threading ensures the ALUs are not idle.
Quote:
first, i'll try and find the interview with one of the head ati engineers where he says they are moving toward higher ratios, as that is the real world performance desired.

I already said that I believe this, and I think I even recall reading the actual interview when R580 was introduced.
Quote:
second, binding buffers and accessing them in a shader unit are different things. just because memory is available to you doesnt mean it is free to access. this (theoretically, cause there is little literature on the physx card for us peons) would be the advantage of faster, more incoherent memory access.

The essential thing is that there's no technical reason to assume that next-generation GPUs would be significantly less efficient at accessing memory than a PPU. Note again my remark about unified architectures above. They just can't afford having slow memory access. Texture access in pixel shaders, yes, that ratio is likely to change significantly, but unsampled access in vertex shaders run on the same same unified shader units need all the efficient memory access they can get.
Quote:
don't know if english is your first language, but this comes off a bit as being an ass.

It's my fourth language but I think I wrote exactly what I intended. My sincere apologies if it was offending, that certainly was not the intention. GPU architectures are quite obviously going to change drastically for Direct3D 10. That was my only point. Just look at Xenos for a peek into the future. It in no way resembles the 'classical' Direct3D 9 architecture introduced by R300.
While reading some of the details in the DirectX 10 SDK documentation I bumped into this:
Quote:
A shader constant buffer is a specialized buffer resource. A single constant buffer can hold up to 4096 4-channel 32-bit elements. As many as 16 constant buffers can be bound to a shader stage simultaneously.

A buffer resource is just a memory array. Now, 4096 x 16B x 16 = 1 MB. That's the size of a powerful CPU's cache! Given that caches take so much die space (remember that the low-end Direct3D 10 chips have to support the same features) and we're talking about just constants here, it very much looks like they'll place it in RAM. So this means that every constant being used would be a memory access. They might place some small cache in between but either way that doesn't change the fact that it's a memory access. The ratio of arithmetic instructions and constants being read must be close to 1:1 so they need the proper hardware to sustain this memory access rate. Also note that there are actually three shader stages and balancing them on a unified architecture (possibly even running different shaders at the same time with threading) lowers access pattern coherency.

If they access constants from memory then it only makes sense to implement the 'load' instruction using the same hardware elements. So you should be able to read from buffer resources at 1:1 ratio (or better) as well.

QED ;-)
Advertisement
Quote:
Original post by C0D1F1ED
I assumed you had the chance to profile the latest SDK with an actual game. You don't need the source to have a look at the code. In CodeAnalyst I simply double-clicked on the hotspots.

Ah yes ok that is what I meant. I was using the PhysX SDK compiling/profiling an actual game which is where I got most of my first hand information from.
Quote:
Original post by Saruman
Ah yes ok that is what I meant. I was using the PhysX SDK compiling/profiling an actual game which is where I got most of my first hand information from.

Cool, but, did you get a chance to see how well it is optimized with SSE?
Quote:
Original post by Anonymous Poster
I followed this discussion for a few pages and then I lost track of it.
One question to C0D1F1ED though. Rendering HDR in HL2 had quite a hit on my 6800 GT. So if HL2 would do its physics calculations on the GPU aswell, what would happen?
Do you really expect next-gen games, that need to look next-gen to get sales, are willing to downgrade how they look to be able to process the physics on the GPU, so that their next-gen game is actually playable on the next-gen graphicscards?



"Next-gen graphics cards" is such a blanket title. It is emperical that the market calls for a wide range of capabilities per generation. The PPU only adds more flux to this. Kind of annoying.

SLI does this too. Surprise... They're the same thing (for the 3rd time?).

What happened to single slot multi-core graphics boards? Too hot? I bet Number 9 would have something to say about this if they were kickin... not to sound old, or as an electronic engineer. :)

That would be true progress IMHO.

I personally would use the SSE capabilities of the machine for my own CPU use.

[Edited by - taby on May 26, 2006 3:24:43 AM]
If the demo box had 2+ GPUs installed, I wonder how it would compare to a GPU / 1+ PPU box, in terms of performance vs cost?

NVIDIA published a general-purpose GPU math toolkit for the Quattro line of cards. Where is the one for the 6800GT? It magically disappeared out of possibility as soon as Ageia came along. Hrm.

This topic is closed to new replies.

Advertisement