I have done the testing for an AMD and an NVIDIA GPU, the Snapdragon 808 will have to wait as setting up the scene for that will take some more time. I will also post the results for the GTX 1070 later.
Here you go:
Program: Wicked Engine Editor
API: DX11
Test scene: Sponza
- 3 shadow cascades (2D) - 3 scene render passes
- 1 spotlight shadow (2D) - 1 scene render pass
- 4 pointlight shadows (Cubemap) - 4 scene render passes
- Z prepass - 1 scene render pass
- Opaque pass - 1 scene render pass
Timing method: DX11 timestamp queries
Methods:
- InputLayout : The default hardware vertex buffer usage with CPU side input layout declarations. The instance buffers are bound as vertex buffers with each render call.
- CustomFetch (typed buffer): Vertex buffers are bound as shader resource views with DXGI_FORMAT_R32G32B32A32_FLOAT format. Instance buffers are bound as Structured Buffers holding a 4x4 matrix each.
- CustomFetch (RAW buffer 1): Vertex buffers are bound as shader resource views with a MiscFlag of D3D11_RESOURCE_MISC_BUFFER_ALLOW_RAW_VIEWS. In the shader the buffers are addressed in byte offsets from the beginning of the buffer. Instance buffers are bound as Structured Buffers holding a 4x4 matrix each.
- CustomFetch (RAW buffer 2): Even instancing information is retrieved from raw buffers instead of structured buffers.
ShadowPass and ZPrepass: These are using 3 buffers max:
- position (float4)
- UV (float4) // only for alpha tested
- instance buffer
OpaquePass: This is using 6 buffers:
- position (float4)
- normal (float4)
- UV (float4)
- previous frame position VB (float4)
- instance buffer (float4x4)
- previous frame instance buffer (float4x3)
RESULTS:
GPU Method ShadowPass ZPrepass OpaquePass All GPU
NVidia GTX 960 InputLayout 4.52 ms 0.37 ms 6.12 ms 15.68 ms
NVidia GTX 960 CustomFetch (typed buffer) 18.89 ms 1.31 ms 8.68 ms 33.58 ms
NVidia GTX 960 CustomFetch (RAW buffer 1) 18.29 ms 1.35 ms 8.62 ms 33.03 ms
NVidia GTX 960 CustomFetch (RAW buffer 2) 18.42 ms 1.32 ms 8.61 ms 33.18 ms
AMD RX 470 InputLayout 7.43 ms 0.29 ms 3.06 ms 14.01 ms
AMD RX 470 CustomFetch (typed buffer) 7.41 ms 0.31 ms 3.12 ms 14.08 ms
AMD RX 470 CustomFetch (RAW buffer 1) 7.50 ms 0.29 ms 3.07 ms 14.09 ms
AMD RX 470 CustomFetch (RAW buffer 2) 7.56 ms 0.28 ms 3.09 ms 14.15 ms
I have attached a txt file with easier readability.
This is quite painful for me because I wanted to implement some features which require the custom fetching but seeing that it works so slow on nvidia it seems like wasted effort.
By the way, to quickly implement this, I bound my vertex buffers to texture slot 30 and upper, could it matter in performance?
Side note: It seems that this way the CPU time is also higher because VSSetShaderResources takes a longer time than IASetVertexBuffers. :(