Advertisement

How Shared Memory is preserved between waves?

Started by May 30, 2018 11:25 PM
45 comments, last by JoeJ 6 years, 8 months ago

My GPU reports 384 lines in total.
I want to make it compute groups of 1024 threads.
So, if I use part of the threads to load an array of 100 elements from a texture to shared memory, how this data survives after the 384th line?
I picture in my mind, the 384 lines load from device memory the 100 elements array. Then they modify those elements. Execution ends, and another 384 lines of the 1024 lines in the groups are being physically reloaded(their registers). But the array in shared memory was physically changed and needs to be restored too.

I don't know if I explain myself well. I picture the physical movement of data. Data was changed, and before the next 384 lines can start execution, they have to be provided with the same start conditions all lines have. This means reload from the slow memory or keeping the information in the shared memory which leads to an error, or I need to give the compiler a hint of some kind.

Is this what DeviceMemoryBarrier stands for?

What do you mean by "lines"? I assume you're talking about the total number of SIMD lanes, which are the execution units that process a single thread?

Most GPU's require that all threads in a thread group using shared memory "live" on the same GPU core (AMD calls them Compute Units or CU's for short, Nvidia calls them Shader Multiprocessor or SM's for short). Most GPU's can over-commit threads to functional units on those cores. For instance, an AMD CU has 4 SIMD units, and each one of those SIMD units can have up to 10 wavefronts active at once. Those 10 wavefronts don't actually execute at the same time, instead the hardware will cycle through them (usually it will try to do so when a shader program encounters a long stall due to memory access). However the max number of waves that it can keep in flight simultaneously (called the "occupancy") is limited by both the number of registers that the shader uses, as well as the shared memory allocation. The hardware will try to fill up the cores with as many wavefronts as possible until it either runs out of registers or it runs out of on-ship memory used as the backing store for shared memory, which is why you usually want to minimize those two things. 

Advertisement

When I use this:
https://msdn.microsoft.com/en-us/library/windows/desktop/mt709115(v=vs.85).aspx

I get:
WaveLaneCountMin:   64
WaveLaneCountMax:  64
TotalLaneCount:          384

I read in internet that the shared memory is shared across all the threads in a thread group. And it is not clear for me, how the GPU reloads its max 384 lines, while keeping logically the meaning of the shared memory. I believe I can code super small shaders and I can fit it logically in 1024 threads, but then I started to imagine it physically and my GPU has only 384.

Just for example, what if I use only the last 5 threads of a large group to load data from device memory into shared memory? The first threads will be executed without anybody have had read the data from device memory. Like, I mean, physically.

(if the thread number is (1023,0,0), read from texture into an array in shared memory) How is this handled? when threads 0-10 are executed, the data is not available yet. This makes me think that DeviceMemoryBarrier should make the GPU to load the data first into shared memory as thread 1023 would do, and only then schedule for execution threads 0-10, 10-20, 20-30... It is all very messy. Executing first the thread 1023, then saving a snapshot of the shared memory for the further threads to use.

This here says "all threads in the group"
https://msdn.microsoft.com/en-us/library/windows/desktop/ff471367(v=vs.85).aspx

Which is very messy to imagine with a group of 1024 and fewer total lines on hardware.

Here first take a look at this page: https://msdn.microsoft.com/en-us/library/windows/desktop/ff471442(v=vs.85).aspx

The picture might help a bit in visualizing things and it also brings up the maxthreads limitation regarding the maximum number of threads in a thread group.  Any way each thread group executes on a CU or SM (amd or nvidia) as MJP has mentioned.  Each CU or SM has its own shared memory so in your case your wave lane count is 64 so you have an amd card.  Second the total lane count is 384 so if you divide this by 64 you get 6.  This means that you can run 6 thread groups at a time and each thread group has its own shared memory.  So lets say your thread groups has 1024 threads each, those 1024 threads will all execute on the same CU (in your case since you' re amd)  with the same shared memory.  In addition in case you didn't already know GPU threading isn't exactly the same as CPU threading.  GPU's use something called SIMT (single instruction multiple thread) which execute all threads in lock step.  This means that instruction 1 of your shader is executed for all threads in a thread group before moving on to instruction 2 of the shader of the thread group.  Then there is something called branch divergence and data divergence (and memory coalescing)  which you should look up to get a better idea of how a modern GPU works.  So your example is prettty much impossible, if instruction 2 loads data into shared memory then instruction 2 is executed upto 1024 times in the above example and if it executed less then 1024 times (branch divergence ie conditional load) you aren't being efficient in using the GPU.  That is unavoidable sometimes but when making things run fast on a GPU its something that is taken into account when choosing/developing an algorithm. 

edit - I should have also mentioned that instruction N of the 1024 threads are executed 64 at a time in the your case until all 1024 lanes are done.  This is implied but since you're having trouble visualizing I guess I should be explict. 

edit2 - it seems I might be wrong at least on nvidia hardware.  Will explain in next post.

edit3 - I was wrong on all hardware.  And that makes sense too since it would be stupid to do something per vendor with regards to having to use a barrier.

-potential energy is easily made kinetic-

I wanted to do something like this in pseudocode:

groupshared float gsdata[100];

[numthreads(1024,1,1)]
void CSMain( threadInd SV_GroupThreadID, ....
   if (threadInd.x > 1023 - 100) {    // "if (threadInd.x < 100) {"   in the original code
       gsdata[threadInd.x - 1023] = dataTexture.Load(int3(threadInd.x - 1023, ....
   }

   DeviceMemoryBarrierWithGroupSync();

   // do things on arbitrary elements of data in the shared array(meaningless things just for the example)
   float valueA = gsdata[threadInd.x / 10 + threadInd.x * 3];
   float valueAplusB =  valueA + gsdata[threadInd.x / 10 + threadInd.x * 5];
   gsdata[threadInd.x / 10] = valueAplusB;
    ......


I wanted to use the first 100 threads to load the data, but I used the last 100 just to make it more messy. If the GPU decides to execute the threads from 0 to 63 first, the array would be full of garbage. I suppose the compiler is intelligent enough to execute firstly the if of the last 100 threads, then after the DeviceMemoryBarrierWithGroupSync command, it starts to execute everything the normal way.
In my real code, all the 1024 threads read from that 100 elements array, so I wanted to load it only once for all the 1024 threads.

I am now searching some more visual info about memory coalescing...

5 hours ago, NikiTo said:

I picture in my mind, the 384 lines load from device memory the 100 elements array. Then they modify those elements. Execution ends, and another 384 lines of the 1024 lines in the groups are being physically reloaded(their registers). But the array in shared memory was physically changed and needs to be restored too.

Shared memory does not need to be restored if the CU switches to process another thread group. Instead each wavefront occupies it's own range of shared memory until it is completed. After completition the content of shared memory is undefined when a new thread group starts executing. This is why using too much shared memory (and / or too much registers / too large thread groups like 1024) limits the GPUs capability to alternate execution of multiple thread groups, which can decrease performance dramatically.

Understanding those limits is probably the most important thing to know about GPUs. AMD has very good documentation about it, and NV is pretty similar (main difference: NV has no scalar registers but more VGPRs). Also important are memory access patterns, e.g. how large pwoer of 2 strides can cause huge slow downs.

https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf

(OpenCL 1.x is the same as compute shaders so the doc is fine, but it uses different terminology sometimes.)

 

6 minutes ago, NikiTo said:

I wanted to use the first 100 threads to load the data, but I used the last 100 just to make it more messy. If the GPU decides to execute the threads from 0 to 63 first, the array would be full of garbage. I suppose the compiler is intelligent enough to execute firstly the if of the last 100 threads, then after the DeviceMemoryBarrierWithGroupSync command, it starts to execute everything the normal way.
In my real code, all the 1024 threads read from that 100 elements array, so I wanted to load it only once for all the 1024 threads.

Oh, maybe i got you wrong initially. 

Your example should work as intended. It is very common to use only a fraction of threads to load data into shared memory, us a shared memory barrier, and after that all threads have access.

But DeviceMemoryBarrierWithGroupSync(); seems wrong. It should be GroupMemoryBarrierWithGroupSync() i think.

(assuming Group means LDS (shared) memory and Device means global memory - too bad each API has it's own terminology...)

Advertisement

Thank you for the links, @JoeJ !
I was not sure if direct compute/openCL were of some use to me. Nice to know that I can learn something from their documentations too.

The best learning resource for me was compute chapter in OpenGL Super Bible. This introduces parallel programming basics like prefix sum with short, practical and enlighting examples.

The technical details / limitations then took some more time for me to get... ;)

Ok so I in my last post I messed up big time.  I forgot something I had read with regards to shared memory.  Anyway long story short you do need to use a groupmemorybarrierwithgroupsync.  Basically a threadgroup size larger than the SIMT width gets broken down into multiple waves(amd)/warps(nvidia) but there is no guarantee of synchronization between them so you have to use a syncpoint.  Even if some hardware did sync up related warps/waves, without a guarantee all hardware works this way you need to standardize the use of the barrier with sync to keep the hardware inconsistencies transparent.  Sorry for my terrible memory causing trouble.

I also watched a bit of this video which confirmed nvidia hardware (at least of the kepler generation) treats all warps independently.  (First 10 minutes)

http://on-demand.gputechconf.com/gtc/2013/video/S3466-Performance-Optimization-Guidelines-GPU-Architecture-Details.mp4

-potential energy is easily made kinetic-

Interesting video!
I see they talk about CUDA cores and GPU cores. It was not clear for me if physically, in a GPU there are cores for compute engine that can not be used for graphics engine, and vice versa.
The async compute and the shader model 6 suggest that compute engine and graphics engine use the same physical chips.
If there is physically separated chips for compute and graphics engines, I should issue at the same time, both graphics and compute workloads to take full advantage of the hardware. This doubt is not clear for me yet.

(I mean physically separated like this:
https://en.wikipedia.org/wiki/Graphics_processing_unit#/media/File:AMD_HD5470_GPU.JPG
it has physically apart hardware for media)
If I can not explain myself well, I will tell it that silly way: if looking at the GPU, one can see a chip for graphics and another chip on the side for compute.

Async compute uses the unused compute units. It is not clear if it can use CU unused by the graphics engine.
I can eliminate graphics pipeline completely, and replace it completely by compute engine.

Maybe compute engine and graphics engine are virtual concepts and project to the same physical ALUs/CUs. It seems more logical for a fabricant to do it this way.

This is not clear for me yet.

Using only DX12 API to program, is taking me lot of effort to change anything. So I need to have as many doubts cleared out before even starting coding. Now I am drawing my pipeline in Photoshop. When I have it all clear I will program it using the DX12 API.

Now I have almost the complete program running only using pixel shaders and trillions of computations are instantly ready when I click the enter. I am glad, I expected for it to take few minutes. I had to slice my algorithm and reverse it in order to make it fit the wicked "read from many write to one" logic of pixel shaders. But if I use shared memory I can save a ton, I mean billions of device memory accesses. That's why I am walking through the pain of rewriting all I have programmed so far to use the compute engine. But if there are separate ALUs for compute and graphics, I maybe should rewrite only part of my old program, and leave some workload for the pixel shaders to do too.

This topic is closed to new replies.

Advertisement