Advertisement

float4 read/write as 4 instructions instead of 1?

Started by January 17, 2018 09:15 PM
1 comment, last by galop1n 7 years ago

Hi guys,

I'm writing a simple Compute Shader in DirectX11, shader model 5, trying to store float4 color into a groupshared memory per thread, then read it back.
From my understanding of MSDN, the instruction store_structured (the same applies for ld_structured), can write 4 x 32bit components at once.

Quote

"This instruction performs 1-4 component *32bit components written from src0 to dst0 at the address in dstAddress and dstByteOffset."

Therefore I would expect one float4 write, translates into one store_structured instruction.
However in my simple shader it translates into 4 store_structured instructions!


Code:


groupshared float4 ColorQuad[4][4][64];

...

ColorQuad[x][y][shared_index] = float4(0.1f, 0.2f, 0.3f, 0.4f);

...

float4 color = ColorQuad[x][y][shared_index];

This code is compiled into this:
 


dcl_tgsm_structured g0, 4096, 4

...

mov r1.x, r0.w
imul null, r1.y, r0.x, l(16)
imad r1.y, r0.z, l(1024), r1.y
store_structured g0.x, r1.x, r1.y, l(0.100000)  // ColorQuad<0>
iadd r1.z, r1.y, l(4)
store_structured g0.x, r1.x, r1.z, l(0.200000)  // ColorQuad<0>
iadd r1.z, r1.y, l(8)
store_structured g0.x, r1.x, r1.z, l(0.300000)  // ColorQuad<0>
iadd r1.y, r1.y, l(12)
store_structured g0.x, r1.x, r1.y, l(0.400000)  // ColorQuad<0>

...

imad r1.y, r0.z, l(1024), r1.y
ld_structured r2.x, r1.x, r1.y, g0.xxxx  // ColorQuad<0:Inf>
iadd r1.z, r1.y, l(4)
ld_structured r2.y, r1.x, r1.z, g0.xxxx  // ColorQuad<1:Inf>
iadd r1.z, r1.y, l(8)
ld_structured r2.z, r1.x, r1.z, g0.xxxx  // ColorQuad<2:Inf>
iadd r1.y, r1.y, l(12)
ld_structured r2.w, r1.x, r1.y, g0.xxxx  // ColorQuad<3:Inf>

Now I'm very confused, why this is happening.
Am I just understanding it wrong, the MSDN actually says it can write 1x32bit / 4x8bit of data?
Or bank conflict compiler optimization?

Thanks for any explanation!

The DXBC bytecode does not matter much compared to the final uCode. Plus the GPUs are scalar these days, so the simd instruction are counter productive to the driver anyway.

 

Could it have do a write4 ? maybe ! But is this important without seeing your GPU uCode ? No.

 

 

This topic is closed to new replies.

Advertisement