Advertisement

Constant buffer and names?

Started by September 23, 2017 12:12 AM
9 comments, last by noodleBowl 7 years, 4 months ago

I've gotten to part in my DirectX 11 project where I need to pass the MVP matrices to my vertex shader. And I'm a little lost when it comes to the use of the constant buffer with the vertex shader

I understand I need to set up the constant buffer just like any other buffer:


1. Create a buffer description with the D3D11_BIND_CONSTANT_BUFFER flag
2. Map my matrix data into the constant buffer
3. Use VSSetConstantBuffers to actually use the buffer

But I get lost at the VertexShader part, how does my vertex shader know to use this constant buffer when we get to the shader side of things

In the example I'm following I see they have this as their vertex shader, but I don't understand how the shader knows to use the MatrixBuffer cbuffer. They just use the members directly. What if there was multiple cbuffer declarations like the Microsoft documentation says you could have?


//Inside vertex shader
cbuffer MatrixBuffer
{
    matrix worldMatrix;
    matrix viewMatrix;
    matrix projectionMatrix;
};

struct VertexInputType
{
    float4 position : POSITION;
    float4 color : COLOR;
};

struct PixelInputType
{
    float4 position : SV_POSITION;
    float4 color : COLOR;
};

PixelInputType ColorVertexShader(VertexInputType input)
{
    PixelInputType output;
    

    // Change the position vector to be 4 units for proper matrix calculations.
    input.position.w = 1.0f;

    // Calculate the position of the vertex against the world, view, and projection matrices.
    output.position = mul(input.position, worldMatrix);
    output.position = mul(output.position, viewMatrix);
    output.position = mul(output.position, projectionMatrix);
    
    // Store the input color for the pixel shader to use.
    output.color = input.color;
    
    return output;
}

 

You'll notice on the VS/GS/PS SetConstantBuffers function there is a start slot argument, and an input array of constant buffer pointers.

When a shader is compiled it assigns a constant buffer slot to each of the cbuffers in the shader. The example shader you posted makes it easy because there is only one constant buffer, which means that it's assigned to slot/register 0. There are 16 constant buffer registers but you shouldn't need that many.

I personally do not trust automatic register assignment at all! I believe it assigns them from top-to-bottom, but if you are paranoid like me you can set the constant buffer slot like so:


cbuffer MatrixBuffer : register(b0)
{
    matrix worldMatrix;
    matrix viewMatrix;
    matrix projectionMatrix;
};

where MatrixBuffer is now assigned to constant buffer slot/register 0.

Otherwise, the mapping to variables in the constant buffer itself is based on the byte data that you map into the constant buffer. I suggest that you read this article about constant buffer packing rules. You have to make sure that byte data you map in matches the appropriate data type in your shader, and that your types meet the 4-byte alignment, 16-byte boundary rules.

Feel free to ask more questions about this because it can cause wacky behavior if you aren't aware of it. For instance if you have a constant buffer like this:


cbuffer MatrixBuffer
{
    float2 SomeData;
    float4 SomeOtherData;
};

You will need to map in 8 floats of data total. Two for the opening float2, two to garbage-pad the remaining 2 floats in the 16 byte (or 4 float) boundary, and then 4 floats to map to the float4. The two floats are required for padding in order for the data to be where you expect it.

Advertisement

Normally we explicitly define the register slots.

So for const buffers you would do:


 

cbuffer MyBuffer0 : register(b0)
{
// Declarations..
};

cbuffer MyBuffer1 : register(b1)
{
// Declarations..
};

cbuffer MyBuffer2 : register(b2)
{
// Declarations..
};

If you do not explicitly tell the register slots, the compiler will assign them for you and you have to retrieve them via HLSL reflection (which is cumbersome and error prone).

When you call VSSetConstantBuffers( 0, ... ) the 0 will correspond to MyBuffer0, and VSSetConstantBuffers( 1, .. ) will correspond to MyBuffer1, etc.

 

In the case of your buffer:


cbuffer MatrixBuffer
{
    matrix worldMatrix;
    matrix viewMatrix;
    matrix projectionMatrix;
};

If the buffer you bind via VSSetConstantBuffers is less than the 192 bytes required for this structure (4x4 x 4 bytes per float x 3 matrices) the debug layer will complain, but you are guaranteed that reading const buffers out of bounds will return 0.

22 hours ago, BrentMorris said:

I believe it assigns them from top-to-bottom, but if you are paranoid like me you can set the constant buffer slot like so:

Indeed, the compiler will assign resource slots in increasing order based on where the resource was declared in the file. However the catch here is that it will do this for resources that are actually used by the shader program. So if you have 3 textures in a row, but you only use the first and third, the first texture will be assigned to t0 and the third will be assigned to t1! In those cases the only reliable way to bind things correctly is to use the reflection API's to query the slot for each resource.

Awesome! When I was reading those docs I wasn't really sure what those register declarations were for. This definitely made it clearer. Thanks!

When it comes to the packing can you explain what is meant by 16-byte boundary rules. Are you saying that in addition to my data being 4 byte aligned it must be divisible by 16 too?

Example:


//Example cbuffer
cbuffer test : register(b0)
{
   float a;
   float b;
}

//Above cbuffer is 4 byte aligned. 4 bytes per float * 2 floats = 8 bytes total; 8 mod 4 = 0 so meets being 4 byte aligned
//BUT does not meet 16-byte boundary rules
//4 bytes per float * 2 floats = 8 bytes total; 8 mod 16 = 8 so does not meet 16-byte boundary rules

So to have the above meet the 16-byte boundary rules I need to add 2 "garbage" floats, which would make the total bytes of the cbuffer divisible by 16

No. That declaration is just fine.


What the alignment means is that if you've got:


float3 a;
float2 b;

Then the address of b when you write the data from C++ starts at 0x0000010 instead of starting at 0x0000000C because there's 4 bytes of padding between a & b

Please read the msdn article BrentMorris left you. It has plenty of examples on how the padding works.

Advertisement

Went back and re-read the packing article. Totally missed that part about how things are auto placed into 4 slot vectors and bumped to the next one if it does not fit entirely.

So it does make sense that my example was fine since it was only 2 floats. And why the example @Matias Goldberg had needs to have 2 of these vector components. The float3 fits into the 1st component, but since the next variable which is a float2 cannot be completely contained in the 1st component it gets placed into the next one

One thing is still shady for me and that is the 16 byte boundary rule. I really don't understand what the article means by it. Are we just placing our variables in 16-byte blocks?

In the article they have


//2 x 16byte elements
cbuffer IE
{
  float1 val1;
  float1 val2;
  float1 val3;
  float2 val4;
}

//3 float1 x 4 bytes = 12
//1 float2 x 4 bytes =  8
//___________________  20 bytes total
//First 16 bytes placed into a container. Next 4 bytes bumped into the next 16 byte container?

Is this idea right?

In DXBC assembly, constant buffers are made up of "elements" that are 16 bytes wide. So the constant buffer will always be made up of N elements, where the total size is then 16 * N bytes. This is why you have to create your constant buffers rounded up to the next multiple of 16 bytes when you call CreateBuffer(). This is also the reason for trying to pack vector types so that they don't cross 16-byte element boundaries. DXBC is basically virtual ISA that works in terms of 4-component vectors, which means that registers and instructions can typically work with 4 values at a time. This applies to constant buffers as well, where each element is 16-byte value that can be treated as a 4-component vector, and can be used in instructions as if it were a register. As an example, let's look at a simple shader and it's resulting DXBC output from the compiler:


cbuffer MyConstants
{
    float4 MyValue;
};

float4 PSMain() : SV_Target0
{
    return MyValue * 8.0f;
}

// ps_5_0
// dcl_globalFlags refactoringAllowed
// dcl_constantbuffer CB0[1], immediateIndexed
// dcl_output o0.xyzw
// mul o0.xyzw, cb0[0].xyzw, l(8.000000, 8.000000, 8.000000, 8.000000)
// ret

You'll see that the whole program is really just a single instruction, where it basically says "multiply the first float4 element from the constant buffer with 8.0". Since "MyValue" is a float4 and is lined up on exactly with a constant buffer "element", the DXBC assembly can reference all of that data and multiply it with a single instruction. Now let's try another example where we split up "MyValue" so that it straddles a 16-byte boundary, which causes it to be located in two different constant buffer elements:


cbuffer MyConstants
{
    float3 SomeOtherValue;
    float MyValue_X;
    float3 MyValue_XYZ;
};

float4 PSMain() : SV_Target0
{
    return float4(MyValue_X, MyValue_XYZ) * 8.0f;
}

// ps_5_0
// dcl_globalFlags refactoringAllowed
// dcl_constantbuffer CB0[2], immediateIndexed
// dcl_output o0.xyzw
// mul o0.x, cb0[0].w, l(8.000000)
// mul o0.yzw, cb0[1].xxyz, l(0.000000, 8.000000, 8.000000, 8.000000)
// ret

In this case the compiler has to emit two separate instructions to perform the multiply, since the instruction can only use a single constant buffer element as an operand. 

Do keep in mind that this is all rather specific to the particulars of DXBC's virtual ISA, which can be (and very often is) very different from the actual native instructions executed by the GPU. For example, Nvidia and AMD have long ago dropped the notion of vector instructions within a single execution thread, and instead only work with scalar operations. So in that case a float4 multiply will always expand out to 4 individual instructions, and so it necessarily doesn't gain them anything to have the source data aligned to a 16-bute boundary in the constant buffer. The new open-source DirectX shader compiler (dxc) has a completely different (scalar) output format, and so they might even change the packing rules for that compiler in the future.

15 hours ago, MJP said:

Do keep in mind that this is all rather specific to the particulars of DXBC's virtual ISA, which can be (and very often is) very different from the actual native instructions executed by the GPU. For example, Nvidia and AMD have long ago dropped the notion of vector instructions within a single execution thread, and instead only work with scalar operations. So in that case a float4 multiply will always expand out to 4 individual instructions, and so it necessarily doesn't gain them anything to have the source data aligned to a 16-bute boundary in the constant buffer. The new open-source DirectX shader compiler (dxc) has a completely different (scalar) output format, and so they might even change the packing rules for that compiler in the future.

Since I think that sounds confusing to a beginner, I'll translate it to plain english: modern GPUs no longer work like that (they don't need such crazy alignments... for the most part, there a few exceptions not worth mentioning right now) but we're stuck with these overly conservative alignments.

I see, I more or less understand what's going on now. Thanks for the info guys!

This topic is closed to new replies.

Advertisement