Vertex Buffer vs Structured Buffer

Author

856

January 04, 2018 02:50 PM

So i was contemplating how to instance skinned geometry and came across this DICE paper (http://www.dice.se/wp-content/uploads/2014/12/GDC11_DX11inBF3_Public.pdf).
I took some time to compare the performance of instancing via vertex buffer binding versus StructuredBuffer+SV_InstanceID.

To be a bit more specific, here's what i compared:
Version A: bind buffer containing a float4x4 per instance as vertex buffer. passed to vertex shader via input layout.
Version B: set structured buffer containing a float4x4 as shader resource view. lookup in vertex shader by SV_InstanceID.

To my surprise, StructuredBuffer comes out ~30% slower on my GTX1080

Does anyone have a similar experience?
I'm wondering if there's any way to optimize this, else DICEs recommendation leaves me puzzled

I suppose i'm still going to use the StructuredBuffer to get around the constant buffer size limitation for bone matrices,
but it'd be so nice to have less shader permutations...

galop1n

1,046

January 04, 2018 09:47 PM

What is your geometry ? how many triangle per instance, how many instance ? are they optimized for vertex cache ? Is the 30% only from that specific draw or for the full frame ? How did you meseare ( gpu marker, render doc, frame delta time ? )

On AMD hardware, a vertex shader reading a structured buffer or a vertex buffer from instance id would looks at best identical and at worst extremely similar, but hard to tell on nVidia tho. They still have a fast path for constant buffer that can outperform regular buffers too, but it is usually not worth the effort of double implementation and maintenance plus size limitation.

turanszkij

545

January 04, 2018 10:15 PM

I had a similar experience as you with Nvidia, I wrote a blog about it: https://turanszkij.wordpress.com/2017/06/05/should-we-get-rid-of-vertex-buffers/

AMD was quite the same as using the input assembler.

Wicked Engine

galop1n

1,046

January 04, 2018 10:47 PM

1 hour ago, turanszkij said:
AMD was quite the same as using the input assembler.

There is no input assembler on AMD hardware, a shader is patch to branch to a fetch shader at the begining, reading vertex buffer as regular buffer and use conversion intrinsic to fill registers. On DX12, the PSO approach allow to inline the fetch shader in your shaders, possibly improving latency hidding and register pressure.

For nvidia, we have less knowledge of the internal ( i still have the side task to document for myself there assembly, visible in PIX ). But you can usually assume that if you run fast enough on AMD, then nVidia is not a concern It is sad, but it is the best you can do without more insight of what to optimize on their GPU.

turanszkij

545

January 05, 2018 12:04 AM

1 hour ago, galop1n said:
There is no input assembler on AMD hardware, a shader is patch to branch to a fetch shader at the begining, reading vertex buffer as regular buffer and use conversion intrinsic to fill registers.

Yes, I know but DX11 has the notion of the IA that's why I wrote it that way.

1 hour ago, galop1n said:
On DX12, the PSO approach allow to inline the fetch shader in your shaders

Are there any steps that need to be done for that or is it done by default? Is it true for PC as well/is it documented for the public? I know the consoles let you do it anyway..

1 hour ago, galop1n said:
For nvidia, we have less knowledge of the internal ( i still have the side task to document for myself there assembly, visible in PIX ). But you can usually assume that if you run fast enough on AMD, then nVidia is not a concern It is sad, but it is the best you can do without more insight of what to optimize on their GPU.

PIX has only the intermediate assembly though not? Or does the new PIX has it? I didn't have the chance to try that yet.

In my experience it's not true that if it runs OK on AMD then it will on Nvidia, because Nvidia performed much worse for me when I inlined the vertex fetches by hand (in DX11). And it was only a simple sponza test scene...

Wicked Engine

galop1n

1,046

January 05, 2018 12:31 AM

18 minutes ago, turanszkij said:
Yes, I know but DX11 has the notion of the IA that's why I wrote it that way.
Are there any steps that need to be done for that or is it done by default? Is it true for PC as well/is it documented for the public? I know the consoles let you do it anyway..
PIX has only the intermediate assembly though not? Or does the new PIX has it? I didn't have the chance to try that yet.
In my experience it's not true that if it runs OK on AMD then it will on Nvidia, because Nvidia performed much worse for me when I inlined the vertex fetches by hand (in DX11). And it was only a simple sponza test scene...

To get vendor disassembly in PIX ( The DX12 only one ), I believe that for AMD, the driver is all you need, and for nVidia, you can request the disassembly DLL if you are a registered developer https://developer.nvidia.com/shader-disasm

The fetch shader inlining is always on from what i saw so far with AMD dx12, it is because, the PSO is statically bound to a unique input layout, and has a guarantee that the compile happen at Creation.

When i said, if it run fine on AMD, then don't worry in nVidia, it is more like, if you achieve your performance target on AMD, even if you do something counter productive on nVidia, you probaly still outperform over the full frame, so no big deal. I would never bind again a vertex buffer as per instance ever again ( unless it is for a very specialized technique case ) because it is cumbersome, less flexible ( try to add extra instance params ?), slower on the CPU and it is notorious that AMD is way worse on Vertex waves than nVidia in the first place anyway…

chlerub

Author

856

January 05, 2018 04:39 PM

18 hours ago, galop1n said:
What is your geometry ? how many triangle per instance, how many instance ? are they optimized for vertex cache ? Is the 30% only from that specific draw or for the full frame ? How did you meseare ( gpu marker, render doc, frame delta time ? )

On AMD hardware, a vertex shader reading a structured buffer or a vertex buffer from instance id would looks at best identical and at worst extremely similar, but hard to tell on nVidia tho. They still have a fast path for constant buffer that can outperform regular buffers too, but it is usually not worth the effort of double implementation and maintenance plus size limitation.

Geometry: cube with 24 vertices,36 indices, TriangleList
Numbers of instances: 512
Measured with: QueryPerformanceCounter
Test A: measure time spent binding vertex/index/instance buffer + DrawIndexedInstanced
Test B: measure time spent vertex/index buffer and bind structured buffer shader resource view + DrawIndexInstanced

chlerub

Author

856

January 05, 2018 04:45 PM

18 hours ago, turanszkij said:
I had a similar experience as you with Nvidia, I wrote a blog about it: https://turanszkij.wordpress.com/2017/06/05/should-we-get-rid-of-vertex-buffers/
AMD was quite the same as using the input assembler.

Very interesting... Your results match my own findings.
And thanks for the reply, feels good not to be alone with such an iffy problem.

Adam Miles

3,468

January 05, 2018 05:10 PM

30 minutes ago, chlerub said:
Geometry: cube with 24 vertices,36 indices, TriangleList
Numbers of instances: 512
Measured with: QueryPerformanceCounter
Test A: measure time spent binding vertex/index/instance buffer + DrawIndexedInstanced
Test B: measure time spent vertex/index buffer and bind structured buffer shader resource view + DrawIndexInstanced

You can't measure GPU time using QueryPerformanceCounter. All you've done is measure how long it takes to issue the API calls, no?

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

chlerub

Author

856

January 05, 2018 05:14 PM

Hmyeah i guess so. But when the API calls alone make such a huge difference... oh well

Vertex Buffer vs Structured Buffer

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Vertex Buffer vs Structured Buffer

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines