Advertisement

Dynamic indexing into constant buffer for instancing

Started by October 31, 2020 09:05 AM
10 comments, last by DaCDR 4 years, 3 months ago

Hi,

I have implemented instancing in my D3D11 engine by using constant buffers:

struct CInstance
{
	float4x3 WorldMatrix;
};
		
cbuffer InstanceBuffer: register(b0)
{
	CInstance Instance[2];
};

The maximum amount of instances per draw call is not known at compile time, because it depends on the primitives currently visible on screen (engine dynamically fills the constant buffer with Map/Unmap). To force an index access in the HLSL shader, I've defined a minimum of 2 instances.

Until now, this works very well on multiple computers (AMD, Nvidia and Intel graphics cards), but now I receive error reports from some of my costumers where objects are missing on screen.

It seems that the "indexing hack" does not work on all graphics card/driver combinations (DX debug does not complain). I thought that indexing always works as long as the underlying buffer is big enough (which it is), but then I've found the following statement in the DirectX specs: https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_3_FunctionalSpec.htm#7.5%20Constant%20Buffers

If the constant buffer bound to a slot is larger than the size declared in the shader for that slot, implementations are allowed to return incorrect data (not necessarily 0) for indices that are larger than the declared size but smaller than the buffer size.

So my question is, how can I implement dynamic instancing, where the amount of instances is determined at runtime? Should I always declare the maximum possible instance count in the shader? This seems to be a performance issue to me when I upload a full 64k sized constant buffer for only 2 or 3 instances.

Kind regards

For this, you want to use a buffer whose size is not known at compile time. Your choices are:

  • StructuredBuffer<CInstance>
  • Buffer<float4>
  • ByteAddressBuffer
  • You can also provide instance data in the input layout by using D3D11_INPUT_PER_INSTANCEDATA instead of D3D11INPUT_PER_VERTEX_DATA

For example using a structured buffer:

StructuredBuffer<CInstance> instanceBuffer : register(t0);

float4 main(float3 pos : POSITION, uint instanceID: SVInstanceID) : SV_Position
{
float4x3 worldmatrix = instanceBuffer[instanceID].worldMatrix;
return mul(mul(float4(pos, 1), worldMatrix), ViewProjectionMatrix);
}

Advertisement

it doesn't get any clearer than this ?

Hi,

thank you for your response. I will try using StructuredBuffer.

Just be aware that there unfortunately can be performance differences between constant buffers and the other buffer types depending on the hardware and your usage patterns. Pre-Turing Nvidia hardware has a special path for constant buffers that's optimized for coherent access (all threads within a warp have the same index), whereas the other buffers are better for incoherent/random access. On AMD it usually doesn't matter between cbuffer/StructuredBuffer/ByteAddressBuffer, those will mostly go through the same HW path. The “formatted” Buffer type is always special: that usually goes through the texture unit since it can do format conversion, so you generally don't want to use that for full fp32 data like a transform.

I've already tested StructuredBuffers and my engine needs a litte more refactoring to fully support them. I'm using instanced skinning, where each instance can have up to 256 world transforms, so a single instance may have more than the allowed 2048 bytes of a single structure. I think I have to split the instance data into multiple instance buffers.

In the meanwhile, I'm thinking about going the following way with constant buffers:

  1. Declare the maximum array size in the shader
  2. Bind a constant buffer which might be smaller than the declared size (and only contains the data of the actual visible primitives)

Using this approach gives the following DX debug warning:

D3D11 WARNING: ID3D11DeviceContext::DrawIndexedInstanced: The size of the Constant Buffer at slot 3 of the Vertex Shader unit is too small (256 bytes provided, 768 bytes, at least, expected). This is OK, as out-of-bounds reads are defined to return 0. It is also possible the developer knows the missing data will not be used anyway. This is only a problem if the developer actually intended to bind a sufficiently large Constant Buffer for what the shader expects. [ EXECUTION WARNING #351: DEVICE_DRAW_CONSTANT_BUFFER_TOO_SMALL]

Is it guaranteed that it will always be OK on every driver/graphics cards combination? This would be a feasible hotfix for me until I implement and tested StructuredBuffers.

Advertisement

You definitely cannot bind a smaller cbuffer than the shader expects, that's not guaranteed to be ok at all and could result in crashes.

@dacdr Hi,

  1. you can create constant buffer with Maximus supporting size == 65536 bytes
  2. during drawing check range access like as:

assert(InstanceCount < ConstantBufferSize / sizeof(CInstance) && “Out Of Range”);

ID3D11Devicecontext::DrawIndexedInstanced(…, InstanceCount,…);

MJP said:

Just be aware that there unfortunately can be performance differences between constant buffers and the other buffer types depending on the hardware and your usage patterns. Pre-Turing Nvidia hardware has a special path for constant buffers that's optimized for coherent access (all threads within a warp have the same index), whereas the other buffers are better for incoherent/random access. On AMD it usually doesn't matter between cbuffer/StructuredBuffer/ByteAddressBuffer, those will mostly go through the same HW path.

What about performance access difference between Constant Buffers instance data vs D3D11_INPUT_PER_INSTANCEDATA ?

3DGraphics,Direct3D12,Vulkan,OpenCL,Algorithms

AndreyVK_D3D said:
you can create constant buffer with Maximus supporting size == 65536 bytes

That does not work, please see my first post. Some drivers ignore out-of-bounds access in the shader, although the underlying constant buffer is large enough.

I'm using StructuredBuffers now, which don't need any hacks.

@AndreyVK_D3D

If you're pulling your instance data from a vertex buffer using the Input Assembler, then it depends on how that particular GPU implements the IA. For a while now AMD hasn't had any fixed-function hardware for vertex fetch, and simply converts your IA configuration to a vertex shader prologue (which means it will have similar to performance to pulling the data manually through a StructuredBuffer or cbuffer). For hardware that does still have dedicated vertex fetch, it can have different performance characteristics from the other methods. However I don't really recommend using the IA for instance data, since it's very inflexible. If you fetch the instance data yourself you have room do more complex things, which can allow to structure your instance data to be more efficient for updates or culling.

This topic is closed to new replies.

Advertisement