5 hours ago, NikiTo said:
I picture in my mind, the 384 lines load from device memory the 100 elements array. Then they modify those elements. Execution ends, and another 384 lines of the 1024 lines in the groups are being physically reloaded(their registers). But the array in shared memory was physically changed and needs to be restored too.
Shared memory does not need to be restored if the CU switches to process another thread group. Instead each wavefront occupies it's own range of shared memory until it is completed. After completition the content of shared memory is undefined when a new thread group starts executing. This is why using too much shared memory (and / or too much registers / too large thread groups like 1024) limits the GPUs capability to alternate execution of multiple thread groups, which can decrease performance dramatically.
Understanding those limits is probably the most important thing to know about GPUs. AMD has very good documentation about it, and NV is pretty similar (main difference: NV has no scalar registers but more VGPRs). Also important are memory access patterns, e.g. how large pwoer of 2 strides can cause huge slow downs.
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf
http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf
(OpenCL 1.x is the same as compute shaders so the doc is fine, but it uses different terminology sometimes.)
6 minutes ago, NikiTo said:
I wanted to use the first 100 threads to load the data, but I used the last 100 just to make it more messy. If the GPU decides to execute the threads from 0 to 63 first, the array would be full of garbage. I suppose the compiler is intelligent enough to execute firstly the if of the last 100 threads, then after the DeviceMemoryBarrierWithGroupSync command, it starts to execute everything the normal way.
In my real code, all the 1024 threads read from that 100 elements array, so I wanted to load it only once for all the 1024 threads.
Oh, maybe i got you wrong initially.
Your example should work as intended. It is very common to use only a fraction of threads to load data into shared memory, us a shared memory barrier, and after that all threads have access.
But DeviceMemoryBarrierWithGroupSync(); seems wrong. It should be GroupMemoryBarrierWithGroupSync() i think.
(assuming Group means LDS (shared) memory and Device means global memory - too bad each API has it's own terminology...)