12 minutes ago, NikiTo said:I will consider the time lost for reading data. I always try to Load() some data as many instruction as possible before it is used, to give the silicon something useful to do meanwhile, but sadly it rarely makes sense for my code. Most often I need data from device memory in the very next instruction
That's common, but notice that you have little influence on the compilers decisions. The compieler already tries to do the same, so it will rearrange your instructions to pre-load anyways. You could use branch to get more control, similar to my prefix sum example where the useless branch improved performance, but it's more likely to make things worse doing so.
Same for registers. Just because you use 20 variables this does not mean the compiler will use 20 registers. Again code will be rearranged and optimized. It's a trade-off between pre-loading data and increasing register usage and the other way around. But the compiler decides on its own.
But messing around here can often make sense as well. Sometimes i put a register value to LDS before a register heavy code section, and after that i get it back. Can help sometimes. Rearranging code myself with trial and error until performance improves can also help. Forcing or preventing loop unrolling is always worth to try. Use bool instead bit packing, because bool means each thread uses already only one bit of a scalar register IIRC.
Those things do no wonders, but they add up.
One other tip is to temporary comment out the code that writes your results, or / and use procedural input, while ensuring the compiler does not optimize away your calculations. This way you see how much time the memory operations take. If it's a lot, this hints bad access patterns. Changing a list size from 256 to 257 doubled the performance of one of my shaders for instance.
27 minutes ago, NikiTo said:My algorithm takes hours in C++ to compute a single pixel.
Now that sounds worrying. My algorithm is something like 50 times faster on FuryX than on i7 CPU with multi threading and SIMD, although it does not saturate the GPU yet - i need to do something else async... But you can expect factors of 30-100, not much more.
However i've needed minutes for a single frame in the beginning, and ten years later it's milliseconds, but that's because improved algorithms, not hardware wonders.