11 hours ago, taby said:
OK, I changed the compute shader to use a local size of 32, but the performance sucks compared to a local size of 1.
I guess you do something wrong - this should never happen.
I assume you now accidently do the same work 32 times, causing 32 x bandwidth for nothing, or something similar.
11 hours ago, taby said:
Does a local size of 1 signify that the compiler should automatically set the local size?
No, the compiler can't do this.
I always recommend the chapter from OpenGL Super Bible to learn about CS. It's really great and explains all the important things quickly:
Indexing - Work groups, global vs. local thread / work indices (This can be confusing! Pretty sure it's your problem actually.)
Parallel algorithms (scan operations like prefix sums, summed area tables).
LDS usage (which turns CS into something completely different from 'dumb' pixel shaders)
Quick example on indexing, imagine we have 1024 workitems.
Using workgroup size of 1 as you do, it should work like this:
CU0: work index 0 (only one thread out of 64 does any work)
CU1: work index 1 (only one thread out of 64 does any work)
CU2: work index 2 (only one thread out of 64 does any work)
...
But what you want is a workgroup size of at least 64:
CU0: work index 0-63 (all threads busy)
CU1: work index 64-127 (all threads busy)
CU1: work index 128-191 (all threads busy)
...
So you get what i mean and it is simple. But APIs and the option to have 2D/3D indexing as well makes it confusing to get it right at first.