Advertisement

How Shared Memory is preserved between waves?

Started by May 30, 2018 11:25 PM
45 comments, last by JoeJ 6 years, 8 months ago
12 minutes ago, NikiTo said:

I will consider the time lost for reading data. I always try to Load() some data as many instruction as possible before it is used, to give the silicon something useful to do meanwhile, but sadly it rarely makes sense for my code. Most often I need data from device memory in the very next instruction

That's common, but notice that you have little influence on the compilers decisions. The compieler already tries to do the same, so it will rearrange your instructions to pre-load anyways. You could use branch to get more control, similar to my prefix sum example where the useless branch improved performance, but it's more likely to make things worse doing so.

Same for registers. Just because you use 20 variables this does not mean the compiler will use 20 registers. Again code will be rearranged and optimized. It's a trade-off between pre-loading data and increasing register usage and the other way around. But the compiler decides on its own.

But messing around here can often make sense as well. Sometimes i put a register value to LDS before a register heavy code section, and after that i get it back. Can help sometimes. Rearranging code myself with trial and error until performance improves can also help. Forcing or preventing loop unrolling is always worth to try. Use bool instead bit packing, because bool means each thread uses already only one bit of a scalar register IIRC.

Those things do no wonders, but they add up.

One other tip is to temporary comment out the code that writes your results, or / and use procedural input, while ensuring the compiler does not optimize away your calculations. This way you see how much time the memory operations take. If it's a lot, this hints bad access patterns. Changing a list size from 256 to 257 doubled the performance of one of my shaders for instance.

27 minutes ago, NikiTo said:

My algorithm takes hours in C++ to compute a single pixel.

Now that sounds worrying. My algorithm is something like 50 times faster on FuryX than on i7 CPU with multi threading and SIMD, although it does not saturate the GPU yet - i need to do something else async... But you can expect factors of 30-100, not much more.

However i've needed minutes for a single frame in the beginning, and ten years later it's milliseconds, but that's because improved algorithms, not hardware wonders.

 

2 hours ago, JoeJ said:

Being out of practice i won't try to follow, but i can tell you that it should not be necessary to calculate occupancy by hand. It's impossible anyways because you can't predict register usage.

Thats understandable but shared memory per thread group is well within your grasp.  So I think you should at least take that into account.  Also trying to save registers isn't necessarily a bad thing and might pay off, but using codeXL or whatever profiling tool you're using sounds like a good idea before you go crazy explicitly trying to save registers with no hope of reaching a target.

-potential energy is easily made kinetic-

Advertisement

 

48 minutes ago, JoeJ said:

But you can expect factors of 30-100, not much more.

 I guess my algorithm improved through the process. Having to literally break my way of thinking to adapt it to the weird way pixel shaders work, gave me some good solutions. I mean, through the process of dealing with GPU, I parallelized the algorithm even more than the planned. I found trees in the algorithm I was not able to see normally. If I were keeping in C++ world, I would never improve the algorithm so much. It is the same algorithm, but the implementation improved a lot. Like regular vs bitsliced implementations of AES.

Now, this is the improvement I expect to get by switching from pixel to compute shader. I can not get below 32KiB of shared memory. But it pays it, because of all the data movements I save and the memory usage I save too. All those textures in the image below were using 1.5G+ of device memory. With compute I need only two textures. It is the most optimized pixel shader pipeline if compute engine were not available.

mage.jpg

Makes sense. (The large thread groups post on AMD site is also an example of excepting inefficiencies if the algorithm demands it.)

But you could use work group size of 512 'for free' eventually in this case and concentrate on changes utilizing the additional threads.

I have data dependencies. The need for ping-ponging doubles the amount of shared memory I need. I wanted to use groups of 1024 initially, but I had to shrink it twice.

Is a matrix of one row expensive?
float1x4  vector;
I may need a vector with indexing capabilities for loops.

12 minutes ago, NikiTo said:

I have data dependencies. The need for ping-ponging doubles the amount of shared memory I need.

You could still use some ping pongs maybe with multiple dispatches. GPUs have high bandwidth, just high latency as well.

I access all my data at least twice, because all logic does not fit into a single shader without making it too complex, for instance.

15 minutes ago, NikiTo said:

Is a matrix of one row expensive?
float1x4  vector;
I may need a vector with indexing capabilities for loops.

There is no hardware matrix.

float1x4 == float4 == float x,y,z,w; in hardware.

I guess you ask because you want to index like this:

int i = myfloat1x4[j];

Unfortunately this does not work like on CPU. Under the hood it will cause code like this:

switch (j)

{

case 0: return myfloat1x4.x; break;

case 1: return myfloat1x4.y; break;

case 2: return myfloat1x4.z; break;

case 3: return myfloat1x4.w; break;

}

 

You mentioned earlier you want to avoid IFs - note that a conditional assignment is not a branch:

int i = j ?  myfloat1x4.x myfloat1x4.y;

This is just one instruction (i guess), like on CPU.

So the above can be optimized and the compiler will do so. A small array of size 4 is thus not that bad.

 

This topic is closed to new replies.

Advertisement