The example:
I have a group size of 512 threads. I want to load data from device memory to the LDS that is 128 bytes.
Like this:
if (groupInd < 128) {
// load the corresponding data element
}
Doing this, I expect two wavefronts to do the job. It would screw up my mood if I code it this way, and the GPU spreads this task over more than two wavefronts creating divergence.