How Shared Memory is preserved between waves?

Nik · 2018-06-04T16:59:41

My GPU reports 384 lines in total. I want to make it compute groups of 1024 threads. So, if I use part of the threads to load an array of 100 elements from a texture to shared memory, how this data survives after the 384th line? I picture in my mind, the 384 lines load from device memory the 100 elements array. Then they modify those elements. Execution ends, and another 384 lines of the 1024 lines in the groups are being physically reloaded(their registers). But the array in shared memory was physically changed and needs to be restored too. I don't know if I explain myself well. I picture the physical movement of data. Data was changed, and before the next 384 lines can start execution, they have to be provided with the same start conditions all lines have. This means reload from the slow memory or keeping the information in the shared memory which leads to an error, or I need to give the compiler a hint of some kind. Is this what DeviceMemoryBarrier stands for?

Graphics and GPU Programming Programming

Started by NikiTo May 30, 2018 11:25 PM

45 comments, last by JoeJ 6 years, 8 months ago

JoeJ

4,408

June 04, 2018 02:15 PM

12 minutes ago, NikiTo said:
I will consider the time lost for reading data. I always try to Load() some data as many instruction as possible before it is used, to give the silicon something useful to do meanwhile, but sadly it rarely makes sense for my code. Most often I need data from device memory in the very next instruction

That's common, but notice that you have little influence on the compilers decisions. The compieler already tries to do the same, so it will rearrange your instructions to pre-load anyways. You could use branch to get more control, similar to my prefix sum example where the useless branch improved performance, but it's more likely to make things worse doing so.

Same for registers. Just because you use 20 variables this does not mean the compiler will use 20 registers. Again code will be rearranged and optimized. It's a trade-off between pre-loading data and increasing register usage and the other way around. But the compiler decides on its own.

But messing around here can often make sense as well. Sometimes i put a register value to LDS before a register heavy code section, and after that i get it back. Can help sometimes. Rearranging code myself with trial and error until performance improves can also help. Forcing or preventing loop unrolling is always worth to try. Use bool instead bit packing, because bool means each thread uses already only one bit of a scalar register IIRC.

Those things do no wonders, but they add up.

One other tip is to temporary comment out the code that writes your results, or / and use procedural input, while ensuring the compiler does not optimize away your calculations. This way you see how much time the memory operations take. If it's a lot, this hints bad access patterns. Changing a list size from 256 to 257 doubled the performance of one of my shaders for instance.

27 minutes ago, NikiTo said:
My algorithm takes hours in C++ to compute a single pixel.

Now that sounds worrying. My algorithm is something like 50 times faster on FuryX than on i7 CPU with multi threading and SIMD, although it does not saturate the GPU yet - i need to do something else async... But you can expect factors of 30-100, not much more.

However i've needed minutes for a single frame in the beginning, and ten years later it's milliseconds, but that's because improved algorithms, not hardware wonders.

Infinisearch

3,058

June 04, 2018 03:01 PM

2 hours ago, JoeJ said:
Being out of practice i won't try to follow, but i can tell you that it should not be necessary to calculate occupancy by hand. It's impossible anyways because you can't predict register usage.

Thats understandable but shared memory per thread group is well within your grasp. So I think you should at least take that into account. Also trying to save registers isn't necessarily a bad thing and might pay off, but using codeXL or whatever profiling tool you're using sounds like a good idea before you go crazy explicitly trying to save registers with no hope of reaching a target.

-potential energy is easily made kinetic-

NikiTo

Author

245

June 04, 2018 04:02 PM

48 minutes ago, JoeJ said:
But you can expect factors of 30-100, not much more.

I guess my algorithm improved through the process. Having to literally break my way of thinking to adapt it to the weird way pixel shaders work, gave me some good solutions. I mean, through the process of dealing with GPU, I parallelized the algorithm even more than the planned. I found trees in the algorithm I was not able to see normally. If I were keeping in C++ world, I would never improve the algorithm so much. It is the same algorithm, but the implementation improved a lot. Like regular vs bitsliced implementations of AES.

Now, this is the improvement I expect to get by switching from pixel to compute shader. I can not get below 32KiB of shared memory. But it pays it, because of all the data movements I save and the memory usage I save too. All those textures in the image below were using 1.5G+ of device memory. With compute I need only two textures. It is the most optimized pixel shader pipeline if compute engine were not available.

JoeJ

4,408

June 04, 2018 04:24 PM

Makes sense. (The large thread groups post on AMD site is also an example of excepting inefficiencies if the algorithm demands it.)

But you could use work group size of 512 'for free' eventually in this case and concentrate on changes utilizing the additional threads.

NikiTo

Author

245

June 04, 2018 04:33 PM

I have data dependencies. The need for ping-ponging doubles the amount of shared memory I need. I wanted to use groups of 1024 initially, but I had to shrink it twice.

Is a matrix of one row expensive?
float1x4 vector;
I may need a vector with indexing capabilities for loops.

JoeJ

4,408

June 04, 2018 04:59 PM

12 minutes ago, NikiTo said:
I have data dependencies. The need for ping-ponging doubles the amount of shared memory I need.

You could still use some ping pongs maybe with multiple dispatches. GPUs have high bandwidth, just high latency as well.

I access all my data at least twice, because all logic does not fit into a single shader without making it too complex, for instance.

15 minutes ago, NikiTo said:
Is a matrix of one row expensive?
float1x4 vector;
I may need a vector with indexing capabilities for loops.

There is no hardware matrix.

float1x4 == float4 == float x,y,z,w; in hardware.

I guess you ask because you want to index like this:

int i = myfloat1x4[j];

Unfortunately this does not work like on CPU. Under the hood it will cause code like this:

switch (j)

{

case 0: return myfloat1x4.x; break;

case 1: return myfloat1x4.y; break;

case 2: return myfloat1x4.z; break;

case 3: return myfloat1x4.w; break;

}

You mentioned earlier you want to avoid IFs - note that a conditional assignment is not a branch:

int i = j ? myfloat1x4.x : myfloat1x4.y;

This is just one instruction (i guess), like on CPU.

So the above can be optimized and the compiler will do so. A small array of size 4 is thus not that bad.

How Shared Memory is preserved between waves?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

How Shared Memory is preserved between waves?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines