float4 read/write as 4 instructions instead of 1?

Graphics and GPU Programming Programming

Started by gamer9xxx January 17, 2018 09:15 PM

1 comment, last by galop1n 7 years ago

gamer9xxx

Author

3

January 17, 2018 09:15 PM

Hi guys,

I'm writing a simple Compute Shader in DirectX11, shader model 5, trying to store float4 color into a groupshared memory per thread, then read it back.
From my understanding of MSDN, the instruction store_structured (the same applies for ld_structured), can write 4 x 32bit components at once.

Quote
"This instruction performs 1-4 component *32bit components written from src0 to dst0 at the address in dstAddress and dstByteOffset."

Therefore I would expect one float4 write, translates into one store_structured instruction.
However in my simple shader it translates into 4 store_structured instructions!

Code:


groupshared float4 ColorQuad[4][4][64];

...

ColorQuad[x][y][shared_index] = float4(0.1f, 0.2f, 0.3f, 0.4f);

...

float4 color = ColorQuad[x][y][shared_index];

This code is compiled into this:


dcl_tgsm_structured g0, 4096, 4

...

mov r1.x, r0.w
imul null, r1.y, r0.x, l(16)
imad r1.y, r0.z, l(1024), r1.y
store_structured g0.x, r1.x, r1.y, l(0.100000)  // ColorQuad<0>
iadd r1.z, r1.y, l(4)
store_structured g0.x, r1.x, r1.z, l(0.200000)  // ColorQuad<0>
iadd r1.z, r1.y, l(8)
store_structured g0.x, r1.x, r1.z, l(0.300000)  // ColorQuad<0>
iadd r1.y, r1.y, l(12)
store_structured g0.x, r1.x, r1.y, l(0.400000)  // ColorQuad<0>

...

imad r1.y, r0.z, l(1024), r1.y
ld_structured r2.x, r1.x, r1.y, g0.xxxx  // ColorQuad<0:Inf>
iadd r1.z, r1.y, l(4)
ld_structured r2.y, r1.x, r1.z, g0.xxxx  // ColorQuad<1:Inf>
iadd r1.z, r1.y, l(8)
ld_structured r2.z, r1.x, r1.z, g0.xxxx  // ColorQuad<2:Inf>
iadd r1.y, r1.y, l(12)
ld_structured r2.w, r1.x, r1.y, g0.xxxx  // ColorQuad<3:Inf>

Now I'm very confused, why this is happening.
Am I just understanding it wrong, the MSDN actually says it can write 1x32bit / 4x8bit of data?
Or bank conflict compiler optimization?

Thanks for any explanation!

galop1n

1,046

January 18, 2018 06:59 PM

The DXBC bytecode does not matter much compared to the final uCode. Plus the GPUs are scalar these days, so the simd instruction are counter productive to the driver anyway.

Could it have do a write4 ? maybe ! But is this important without seeing your GPU uCode ? No.

float4 read/write as 4 instructions instead of 1?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

float4 read/write as 4 instructions instead of 1?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines