Advertisement

physx chip

Started by April 20, 2006 07:42 PM
223 comments, last by GameDev.net 18 years, 5 months ago
Quote: Original post by Saruman
Hey I totally understand I used to actually be sceptical of it in the beginning. Once I actually saw an engine and benchmarks running the same application both with software and on the PhysX chip I realized how uber it is :)

I'm sure it was impressive. But can I ask, did you never have even the slightest feeling that a powerful CPU and fully optimized software might be able to do almost the same thing? Was there no doubt at all that maybe they cripled the software version one way or another to make PhysX look "uber" in comparison?

I mean, let's look at the raw numbers once more. With a PhysX chip at 500 MHz with 8 SIMD units each capable of 4+1 floating-point operations we end up with 20 GFLOPS (AGEIA specifies 20 giga-instructions). With a Core 2 Duo at 3.0 GHz you get 48 GFLOPS. Even if you consider only one core to be available for physics that's still 24 GFLOPS. So why is it that in AGEIA's PhsyX demos the software version doesn't even get close to the PhysX version?

And let's face it, what you saw was a demo specifically for showcasing PhysX by stressing it to the maximum, not an actual game. When looking at GRAW, I still have difficulties believing that those explosions with a few times more particles can't be computed on today's CPUs. When I look at AGEIA's own NovodeX Rocket demo they show hundreds to thousands of objects at 300+ FPS without maxing out my dual-core Athlon.
Quote: Original post by C0D1F1ED
I'm sure it was impressive. But can I ask, did you never have even the slightest feeling that a powerful CPU and fully optimized software might be able to do almost the same thing?

Not at all, I ran the software version on a very powerful CPU and Aegia is very well known having one of the fastest Physics engines on the market in software as well... if not the fastest now with dual-core becoming commonplace.

Quote: Was there no doubt at all that maybe they cripled the software version one way or another to make PhysX look "uber" in comparison?

I compiled it myself and ran it, nothing at all was crippled or changed it was exactly the same.

Quote:
So why is it that in AGEIA's PhsyX demos the software version doesn't even get close to the PhysX version?

Why can't you render a game completely in software on the CPU instead of using a midrange GPU? Dedicated hardware is much faster at processing than a general purpose processor as has been echoed throughout the thread.

Quote: And let's face it, what you saw was a demo specifically for showcasing PhysX by stressing it to the maximum, not an actual game.

What I saw was a game in development and not a demo at all, it was a full networked multiplayer FPS in development using PhysX software and hardware.
Advertisement
Quote: Original post by Anonymous Poster
no offense, but this comment seems rather silly. a new API isn't going to fundamentally change the architecture of the chip.

No offense, but did you pay attention during the R300 (Radeon 9700) versus NV30 (Geforce FX 5800) era? NV30 was, despite its heat production and unbearable fan noise, very competitive for Direct3D 8 but totally crap for Direct3D 9. R300 had a revolutionary architecture while NV30 was almost like an NV25 with floating-point operations. NVIDIA had to create NV40 from the ground up (with great succes by the way) to join the Direct3D 9 party.

Direct3D 10 demands another major change in architecture. It would be foolish for NVIDIA/ATI not to have invested a great deal of R&D into it since Microsoft started to draft up the specifications. Heck, it's probably NVIDIA/ATI who tell Microsoft what they want.
Quote: furthermore, card designers are going to move toward real life shader usage, ie that 5:1 ratio, not away from it. its just cost/benefit, transistor/benefit, whatever you want to call it.

Well I can certainly believe that the arithmetic:texture ratio will go to 5:1. But like I said before Direct3D 10 allows to access memory directly without sampling. So instead of calling SetPixelShaderConstantF all the time you can just bind arbitrary memory buffers, allowing many new possibilities. This will clearly lower the ratio so they'll have to make sure the architecture is capable of feeding the data efficiently. And this greatly benefits GPGPU processing including physics as well.
Quote: witness floating point buffers. 32bit/component textures are unfiltered any ops you want to do with them you have to do manually through shaders...but its still through samplers.

I hope it's clear now that's all going to change?
Quote: Original post by Saruman
Not at all, I ran the software version on a very powerful CPU and Aegia is very well known having one of the fastest Physics engines on the market in software as well... if not the fastest now with dual-core becoming commonplace.

Well I checked something and got shocking results...

I used AMD CodeAnalyst to profile NovodeX Rocket during the 'Seige 2' demo. It's a cool demo in my opinion and runs at 60 FPS on a Geforce 6600 (non-GT) and Athlon 64 X2 4400+ @ 2.4 GHz. So I expected it to be physics limited but it turned out that 60% of time was spend in the graphics driver and kernel, and only 30% in the NovodeX DLL. Of this 30%, half was spent in one function: NxBuildSmoothNormals. But the most shocking thing of all was that it didn't use a single SSE instruction, only x87!

Now, I don't know how old the Rocket demo is exactly and how things have evolved in the last months but clearly I haven't seen a fraction of what is actually possible on a CPU.
Quote: I compiled it myself and ran it, nothing at all was crippled or changed it was exactly the same.

So did you get a chance to see if it was it fully SSE optimized? If you did, do they use the uber fast approximation instructions for division and square root or the full precision ones?
Quote: Why can't you render a game completely in software on the CPU instead of using a midrange GPU? Dedicated hardware is much faster at processing than a general purpose processor as has been echoed throughout the thread.

Yes we've been through this and I believe you missed my answer or I wasn't clear enough. GPUs are much faster at graphics mainly because they have rasterization units (computing coverage masks and interpolating coordinates) and texture samplers that work fully pipelined and in parallel. But when I wrote swShader it quickly became clear that CPUs are actually surprisingly fast for the arithmetic shader instructions. Rasterization and texture sampling takes the bulk of execution time while making the shader longer (without extra textures) didn't have that much of an impact. And this is logical, a CPU's GFLOPS are almost directly comparable to a GPU's GFLOPS for arithmetic shader instructions. Obviously the current generation of GPUs has significantly more GFLOPS than any CPU but it's still very roughly comparable while other operations like sampling take ages for the CPU.

So, to repeat my original question: Why is it that in AGEIA's PhsyX demos the software version doesn't even get close to the PhysX version? And to clarify that with the above: It should be mostly SIMD operations, for which modern CPUs are not to be underestimated.
Quote: What I saw was a game in development and not a demo at all, it was a full networked multiplayer FPS in development using PhysX software and hardware.

Fair enough. Can I ask what kind of revolutionary physics this game used? I'm mostly interested to know whether it was 'eye-candy' physics we could do without or 'gameplay' physics that is crucial and absolutely requires high processing power. Also, was this physics workload present all of the time or only periodically? Thanks.
Quote: Original post by C0D1F1ED
I used AMD CodeAnalyst to profile NovodeX Rocket during the 'Seige 2' demo.

Quite honestly I've never run the Rocket demo and due to not seeing anything first hand I am unable to comment on that.

Quote: So did you get a chance to see if it was it fully SSE optimized? If you did, do they use the uber fast approximation instructions for division and square root or the full precision ones?

I'm not sure what you mean here? I meant that I compiled the actual game source that uses the actual PhysX SDK, I didn't have source to the API or anything like that.

Quote: Fair enough. Can I ask what kind of revolutionary physics this game used? I'm mostly interested to know whether it was 'eye-candy' physics we could do without or 'gameplay' physics that is crucial and absolutely requires high processing power. Also, was this physics workload present all of the time or only periodically? Thanks.

I can't really say much other than that the game had both gameplay and eye-candy physics.
Everyone, to be technically honest, geometry and linear algebra are not physics.

Yet both the GPU and PPU use these subjects in the process of physical simulation.

What is the difference? The PPU is an overclocked headless GPU when the ratio of complexity is taken into consideration.

Anti-Wow.

Is Ageia public? If not, then no-one gets to know who really has a stake in its future.
Advertisement
Quote: Original post by C0D1F1ED
Direct3D 10 demands another major change in architecture. It would be foolish for NVIDIA/ATI not to have invested a great deal of R&D into it since Microsoft started to draft up the specifications. Heck, it's probably NVIDIA/ATI who tell Microsoft what they want.

huh? just because they have generalized buffers does not mean memory access mechanisms or patterns are going to change. the only real changes that are going to happen are adding the geometry/primitive shader stage and to make it easier to shuffle memory around between stages. youre still going to be accessing memory coherently (nice thing about textures and vertex buffers) and not that often (in the case of shaders)

Quote: Well I can certainly believe that the arithmetic:texture ratio will go to 5:1. But like I said before Direct3D 10 allows to access memory directly without sampling. So instead of calling SetPixelShaderConstantF all the time you can just bind arbitrary memory buffers, allowing many new possibilities.

what do you think you'll be doing with those bound buffers? accessing them in shaders, at an average of 5:1 arithmentic ops to texture ops, or you'll just have a stalling alu...

Quote: This will clearly lower the ratio so they'll have to make sure the architecture is capable of feeding the data efficiently. And this greatly benefits GPGPU processing including physics as well.

first, i'll try and find the interview with one of the head ati engineers where he says they are moving toward higher ratios, as that is the real world performance desired. second, binding buffers and accessing them in a shader unit are different things. just because memory is available to you doesnt mean it is free to access. this (theoretically, cause there is little literature on the physx card for us peons) would be the advantage of faster, more incoherent memory access.

that's about the limit of the stuff i know on the physx card though...like i said, kylotan may have meant something different.

Quote: I hope it's clear now that's all going to change?

don't know if english is your first language, but this comes off a bit as being an ass.
Quote: Original post by Saruman
Quite honestly I've never run the Rocket demo and due to not seeing anything first hand I am unable to comment on that.

It's a free download. CodeAnalyst is also for free and if you have an Intel CPU you can use VTune for 14 days.
Quote: I'm not sure what you mean here? I meant that I compiled the actual game source that uses the actual PhysX SDK, I didn't have source to the API or anything like that.

I assumed you had the chance to profile the latest SDK with an actual game. You don't need the source to have a look at the code. In CodeAnalyst I simply double-clicked on the hotspots. I expected to see SSE instructions but it was x87 code with expensive divisions and such. I also set an event trigger to see if has any SSE code at all. Nope.
Quote: I can't really say much other than that the game had both gameplay and eye-candy physics.

I understand. Thanks anyway!
Quote: Original post by Anonymous Poster
huh? just because they have generalized buffers does not mean memory access mechanisms or patterns are going to change.

They will definitely optimize unsampled memory access. If it would use the texture units it would be a big waste. The R580 is already greatly hampered by its low number of texture units. Lots of big surfaces like walls still use shaders with a high ratio of texture samples. Also, with a unified architecture they have to be able to execute vertex, geometry and pixel shaders. So even though pixel shaders might evolve to 5:1 ratio they clearly need massive bandwidth without sampling for vertex/geometry processing.

So I'm sorry but memory access mechanisms and patterns are most definitely going to change. And GPGPU processing, including physics, will benefit.
Quote: the only real changes that are going to happen are adding the geometry/primitive shader stage and to make it easier to shuffle memory around between stages. youre still going to be accessing memory coherently (nice thing about textures and vertex buffers) and not that often (in the case of shaders)

Even for coherent accesses the latency of texture sampling is high. Just think of anisotropic filtering, now considered standard. It uses several bilinear samples so even if all accessed texels are in the texture cache it's going to take many clock cycles. Still, we observe that the performance hit of anisotropic filtering is very modest for modern hardware. So GPUs must be excellent at hiding long latencies. Ultra-Threading as ATI calls it is one effective approach.

And again the essential part of it all for this discussion is that this is very beneficial for physics processing. They just have to combine it all into the next-generation Direct3D 10 graphics hardware.
Quote: what do you think you'll be doing with those bound buffers? accessing them in shaders, at an average of 5:1 arithmentic ops to texture ops, or you'll just have a stalling alu...

They are not texture operations. They don't have to use the expensive sampling units at all. They are memory operations and yes they can read from textures but it's totally different from sampling. If they do things right they could easily have a 1:1 ratio without stalling any ALU. The latencies are likely higher than arithmetic operations but that's where Ultra-Threading ensures the ALUs are not idle.
Quote: first, i'll try and find the interview with one of the head ati engineers where he says they are moving toward higher ratios, as that is the real world performance desired.

I already said that I believe this, and I think I even recall reading the actual interview when R580 was introduced.
Quote: second, binding buffers and accessing them in a shader unit are different things. just because memory is available to you doesnt mean it is free to access. this (theoretically, cause there is little literature on the physx card for us peons) would be the advantage of faster, more incoherent memory access.

The essential thing is that there's no technical reason to assume that next-generation GPUs would be significantly less efficient at accessing memory than a PPU. Note again my remark about unified architectures above. They just can't afford having slow memory access. Texture access in pixel shaders, yes, that ratio is likely to change significantly, but unsampled access in vertex shaders run on the same same unified shader units need all the efficient memory access they can get.
Quote: don't know if english is your first language, but this comes off a bit as being an ass.

It's my fourth language but I think I wrote exactly what I intended. My sincere apologies if it was offending, that certainly was not the intention. GPU architectures are quite obviously going to change drastically for Direct3D 10. That was my only point. Just look at Xenos for a peek into the future. It in no way resembles the 'classical' Direct3D 9 architecture introduced by R300.
While reading some of the details in the DirectX 10 SDK documentation I bumped into this:
Quote: A shader constant buffer is a specialized buffer resource. A single constant buffer can hold up to 4096 4-channel 32-bit elements. As many as 16 constant buffers can be bound to a shader stage simultaneously.

A buffer resource is just a memory array. Now, 4096 x 16B x 16 = 1 MB. That's the size of a powerful CPU's cache! Given that caches take so much die space (remember that the low-end Direct3D 10 chips have to support the same features) and we're talking about just constants here, it very much looks like they'll place it in RAM. So this means that every constant being used would be a memory access. They might place some small cache in between but either way that doesn't change the fact that it's a memory access. The ratio of arithmetic instructions and constants being read must be close to 1:1 so they need the proper hardware to sustain this memory access rate. Also note that there are actually three shader stages and balancing them on a unified architecture (possibly even running different shaders at the same time with threading) lowers access pattern coherency.

If they access constants from memory then it only makes sense to implement the 'load' instruction using the same hardware elements. So you should be able to read from buffer resources at 1:1 ratio (or better) as well.

QED ;-)

This topic is closed to new replies.

Advertisement