Advertisement

Vertex Buffer vs Structured Buffer

Started by January 04, 2018 02:50 PM
10 comments, last by galop1n 7 years, 1 month ago

So i was contemplating how to instance skinned geometry and came across  this DICE paper (http://www.dice.se/wp-content/uploads/2014/12/GDC11_DX11inBF3_Public.pdf).
I took some time to compare the performance of instancing via vertex buffer binding versus StructuredBuffer+SV_InstanceID.

To be a bit more specific, here's what i compared:
Version A: bind buffer containing a float4x4 per instance as vertex buffer. passed to vertex shader via input layout.
Version B: set structured buffer containing a float4x4 as shader resource view. lookup in vertex shader by SV_InstanceID.

To my surprise, StructuredBuffer comes out ~30% slower on my GTX1080

Does anyone have a similar experience?
I'm wondering if there's any way to optimize this, else DICEs recommendation leaves me puzzled :|

I suppose i'm still going to use the StructuredBuffer to get around the constant buffer size limitation for bone matrices,
but it'd be so nice to have less shader permutations...

What is your geometry ? how many triangle per instance, how many instance ? are they optimized for vertex cache ? Is the 30% only from that specific draw or for the full frame ? How did you meseare ( gpu marker, render doc, frame delta time ? )

 

On AMD hardware, a vertex shader reading a structured buffer or a vertex buffer from instance id would looks at best identical and at worst extremely similar, but hard to tell on nVidia tho. They still have a fast path for constant buffer that can outperform regular buffers too, but it is usually not worth the effort of double implementation and maintenance plus size limitation.

Advertisement

I had a similar experience as you with Nvidia, I wrote a blog about it: https://turanszkij.wordpress.com/2017/06/05/should-we-get-rid-of-vertex-buffers/

AMD was quite the same as using the input assembler. :)

1 hour ago, turanszkij said:

AMD was quite the same as using the input assembler.

There is no input assembler on AMD hardware, a shader is patch to branch to a fetch shader at the begining, reading vertex buffer as regular buffer and use conversion intrinsic to fill registers. On DX12, the PSO approach allow to inline the fetch shader in your shaders, possibly improving latency hidding and register pressure.

 

For nvidia, we have less knowledge of the internal ( i still have the side task to document for myself there assembly, visible in PIX ). But you can usually assume that if you run fast enough on AMD, then nVidia is not a concern :) It is sad, but it is the best you can do without more insight of what to optimize on their GPU.

1 hour ago, galop1n said:

There is no input assembler on AMD hardware, a shader is patch to branch to a fetch shader at the begining, reading vertex buffer as regular buffer and use conversion intrinsic to fill registers. 

Yes, I know but DX11 has the notion of the IA that's why I wrote it that way.

1 hour ago, galop1n said:

On DX12, the PSO approach allow to inline the fetch shader in your shaders

Are there any steps that need to be done for that or is it done by default? Is it true for PC as well/is it documented for the public? I know the consoles let you do it anyway..

1 hour ago, galop1n said:

For nvidia, we have less knowledge of the internal ( i still have the side task to document for myself there assembly, visible in PIX ). But you can usually assume that if you run fast enough on AMD, then nVidia is not a concern  It is sad, but it is the best you can do without more insight of what to optimize on their GPU.

PIX has only the intermediate assembly though not? Or does the new PIX has it? I didn't have the chance to try that yet.

In my experience it's not true that if it runs OK on AMD then it will on Nvidia, because Nvidia performed much worse for me when I inlined the vertex fetches by hand (in DX11). And it was only a simple sponza test scene...

18 minutes ago, turanszkij said:

Yes, I know but DX11 has the notion of the IA that's why I wrote it that way.

Are there any steps that need to be done for that or is it done by default? Is it true for PC as well/is it documented for the public? I know the consoles let you do it anyway..

PIX has only the intermediate assembly though not? Or does the new PIX has it? I didn't have the chance to try that yet.

In my experience it's not true that if it runs OK on AMD then it will on Nvidia, because Nvidia performed much worse for me when I inlined the vertex fetches by hand (in DX11). And it was only a simple sponza test scene...

To get vendor disassembly in PIX ( The DX12 only one ), I believe that for AMD, the driver is all you need, and for nVidia, you can request the disassembly DLL if you are a registered developer https://developer.nvidia.com/shader-disasm

The fetch shader inlining is always on from what i saw so far with AMD dx12, it is because, the PSO is statically bound to a unique input layout, and has a guarantee that the compile happen at Creation.

 

 

When i said, if it run fine on AMD, then don't worry in nVidia, it is more like, if you achieve your performance target on AMD, even if you do something counter productive on nVidia, you probaly still outperform over the full frame, so no big deal. I would never bind again a vertex buffer as per instance ever again ( unless it is for a very specialized technique case ) because it is cumbersome, less flexible ( try to add extra instance params ?), slower on the CPU and it is notorious that AMD is way worse on Vertex waves than nVidia in the first place anyway…

 

 

Advertisement
18 hours ago, galop1n said:

What is your geometry ? how many triangle per instance, how many instance ? are they optimized for vertex cache ? Is the 30% only from that specific draw or for the full frame ? How did you meseare ( gpu marker, render doc, frame delta time ? )

 

On AMD hardware, a vertex shader reading a structured buffer or a vertex buffer from instance id would looks at best identical and at worst extremely similar, but hard to tell on nVidia tho. They still have a fast path for constant buffer that can outperform regular buffers too, but it is usually not worth the effort of double implementation and maintenance plus size limitation.

Geometry: cube with 24 vertices,36 indices, TriangleList
Numbers of instances: 512
Measured with: QueryPerformanceCounter 
Test A: measure time spent binding vertex/index/instance buffer + DrawIndexedInstanced
Test B: measure time spent vertex/index buffer and bind structured buffer shader resource view + DrawIndexInstanced
 

18 hours ago, turanszkij said:

I had a similar experience as you with Nvidia, I wrote a blog about it: https://turanszkij.wordpress.com/2017/06/05/should-we-get-rid-of-vertex-buffers/

AMD was quite the same as using the input assembler.

Very interesting... Your results match my own findings.
And thanks for the reply, feels good not to be alone with such an iffy problem.

30 minutes ago, chlerub said:

Geometry: cube with 24 vertices,36 indices, TriangleList
Numbers of instances: 512
Measured with: QueryPerformanceCounter 
Test A: measure time spent binding vertex/index/instance buffer + DrawIndexedInstanced
Test B: measure time spent vertex/index buffer and bind structured buffer shader resource view + DrawIndexInstanced
 

You can't measure GPU time using QueryPerformanceCounter. All you've done is measure how long it takes to issue the API calls, no?

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Hmyeah i guess so. But when the API calls alone make such a huge difference... oh well

This topic is closed to new replies.

Advertisement