I was trying to solve this for a whole month now.
I am recomputing various times the same equation. The equation takes few float operations and even a sqrt.
I tried in various ways to compute the equation only once and store it in LDS and then read the result from there instead of recomputing it.
Surprisingly for me, i could not find a way to beat recomputing with LDS. Recomputing the same equation many times is the fastest option. I tried at least 4 different approaches/shaders. At least 4 main shaders and a lot of variations of each of these shaders. I was for a whole month trying to solve this.
On Internet i read LDS is super fast. But at least for me, reading from LDS is slower than few float operations and a square root. A lot slower.
One of the shaders has groupsize of 64 on AMD, in order to leave multi-CU-ing out of count(or at least reduce it). Groupsize of 64 with LDS is slower than groupsize of 64 recomputing the same equation.
Is it normal LDS to be so slower than the ALU? I would not wonder if LDS is slower than a sum operation. But even slower than a square?!?!
Can not show code. Just chatting here.
Using LDS slower than recomputing?!
I'm not an expert on GPU computing, but if its at least somehow similar to how CPUs work then it doesn't surprise me that directly computing something (even with sqrt) is faster than access from memory.
Even memory-access from cache is usually a few times slower than just computing something in the registers. So for a few years the trend has been moving from precomputing to just calculating on the fly, on CPUs. So maybe the same holds true for GPUs as well.
NikiTo said:
On Internet i read LDS is super fast. But at least for me, reading from LDS is slower than few float operations and a square root. A lot slower.
It is fast if you compare it against accessing the memory, and stuff is usually compared against that.
https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/
The article above puts some numbers on this with the Fury X: the peak memory bandwidth is 512 GB/s, the peak LDS bandwidth is 8.6 TB/s and the peak register bandwidth is 51.6 TB/s. So yeah it can be much faster than accessing memory, but if your computation is not complicated enough, recalculating it every time can be faster.
Thanks to both of you!
Can a bad LDS reading pattern be the reason?
A single lane generates more than one precomputed values. For example, in the case of a groupsize of 64, each lane generates 64 precomputed values and the whole shader needs 4096 DWORDs of groupshared data. It is just an example. It could be 64x32, then 64x32 for the next 32 values. Just some example numbers. But yes, every lane generates a lot of precomputed values spread over the LDS.
NikiTo said:
Can a bad LDS reading pattern be the reason?
Read up on LDS ‘bank conflicts’ about your GPU.
However, you mention sqrt plus some other math ops. IIRC, GCN can do one sqrt on SFU while SIMDs process 4 non transcendent math ops. NV probably similar. So pipelining may hide most of the sqrt cost.
Sounds you tried a lot of options already, so probably there is just no chance to get a win out of caching to LDS here.
But you mention to use a lot of LDS, so if you could reduce this by recalculating things more often, you could increase occupancy and get a real speedup eventually.
I did it. I fixed the LDS access pattern. At any moment all the lanes access only the same bank of 256 DWORDs.
This way, Precomputing and Recomputing became the same speed.
But only for workgroup of 64.
LDS became my bottleneck in another way.
Because every lane generates more than one value to store in LDS, a wave could need an array in LDS of the size of 4096. Plus, i need more LDS for other stuff too.
So, i can not have regular big workgroups. I explain -
Shader A - Recomputing shader with groupsize of 1024.
Shader B - Precomputing shader with group size of 1024.
Shader C - Precomputing shader with group size of 64.
All the three shaders parse the same amount of data.
Shaders B and C are the same more or less, in the meaning they need to loop.
In order to parse the same amount of data with the shader B and C, i need a loop. Shader A naturally parses 16 waves. But shaders B and C need to loop 16 times in order to parse the same data.
The input and output of the three shaders are the same. They do the same.
Just that inside shaders B and C i have for(uint i = 0; i<16; i++) and the focus of computing changes with each iteration. In shader B i need a groupSync inside the loop to prevent from other CUs trying to use the same LDS all at the same time.
case 1 - inside shader A i put “if (threadFlatID < 64)” to parse only one chunk of data of 64. Inside shaders B and C i put “for(…i<1…)” to parse only one 64 chunk of data too. In this case the executing time is the same with all of the three shaders.
case 2 - inside shader A i put “if (threadFlatID < 128)” to parse two chunks of data of 64. Inside shaders B and C i put “for(…i<2…)” to parse two chunks of data of 64 too. The execution time of shader A raises just slightly. While the execution time of shaders B and C DOUBLES!!
For 3 chunks of 64, shader A is just slightly slower again, but shaders B and C are THREE times slower now.
I can parse 16 chunks of data with shader A in the same time as 2 chunks of data with shaders B and C. Or even faster.
I guess, recomputing 1024 groupsize shader allows for other CUs to help. While the shaders B and C explicitly require for the single wave pipeline to proceed in linear.
So i am keeping the 1024 groupsize shader that Recomputes the values.
But now i will revise the pattern of all of my LDS accesses everywhere in every older shader. I can easily notice a speedup if i respect the LDS banks.
(My AMD GPU has 6CUs)
(it is the LDS requirement that bottlenecks me. If i loop inside shader B, other CUs can grab work, but i need to remove the groupsynch barrier. But it can not work that way, because only one 64 chunk can use the LDS at the same time. Without a synch barrier it's fast, but it prints garbage as result)
The loop inside shader B is something like that -
[numthreads(64, 16, 1)]
…..
for (uint i = 0; i < 16; i++) {
if (threadID.y == i) {
…..
GroupMemoryBarrierWithGroupSync();
}
}