Hi guys,
I'm writing a simple Compute Shader in DirectX11, shader model 5, trying to store float4 color into a groupshared memory per thread, then read it back.
From my understanding of MSDN, the instruction store_structured (the same applies for ld_structured), can write 4 x 32bit components at once.
Quote"This instruction performs 1-4 component *32bit components written from src0 to dst0 at the address in dstAddress and dstByteOffset."
Therefore I would expect one float4 write, translates into one store_structured instruction.
However in my simple shader it translates into 4 store_structured instructions!
Code:
groupshared float4 ColorQuad[4][4][64];
...
ColorQuad[x][y][shared_index] = float4(0.1f, 0.2f, 0.3f, 0.4f);
...
float4 color = ColorQuad[x][y][shared_index];
This code is compiled into this:
dcl_tgsm_structured g0, 4096, 4
...
mov r1.x, r0.w
imul null, r1.y, r0.x, l(16)
imad r1.y, r0.z, l(1024), r1.y
store_structured g0.x, r1.x, r1.y, l(0.100000) // ColorQuad<0>
iadd r1.z, r1.y, l(4)
store_structured g0.x, r1.x, r1.z, l(0.200000) // ColorQuad<0>
iadd r1.z, r1.y, l(8)
store_structured g0.x, r1.x, r1.z, l(0.300000) // ColorQuad<0>
iadd r1.y, r1.y, l(12)
store_structured g0.x, r1.x, r1.y, l(0.400000) // ColorQuad<0>
...
imad r1.y, r0.z, l(1024), r1.y
ld_structured r2.x, r1.x, r1.y, g0.xxxx // ColorQuad<0:Inf>
iadd r1.z, r1.y, l(4)
ld_structured r2.y, r1.x, r1.z, g0.xxxx // ColorQuad<1:Inf>
iadd r1.z, r1.y, l(8)
ld_structured r2.z, r1.x, r1.z, g0.xxxx // ColorQuad<2:Inf>
iadd r1.y, r1.y, l(12)
ld_structured r2.w, r1.x, r1.y, g0.xxxx // ColorQuad<3:Inf>
Now I'm very confused, why this is happening.
Am I just understanding it wrong, the MSDN actually says it can write 1x32bit / 4x8bit of data?
Or bank conflict compiler optimization?
Thanks for any explanation!