Alright, so I've set up timer queries wherever I needed and profiled the hell out of several variations of code. So, first some notes about the setup:
Application generates dynamic voxel representation of Crytek's Sponza at 512x512x512 resolution, which is then mip-mapped. For proof-testing (that my mip maps are correct), I've calculated ambient occlusion using voxel cone tracing. The timestamps are obtained before and after each Dispatch during mipmap generation, and also right before and after the loop for generating mipmaps (so I can calculate overhead).
And my test scenarios.
1. Naive generation
Using 2x2x2 workgroup I generate lower miplevel from each higher miplevel. The generation stores 2x2x2 voxels in groupshared memory and calculates lower miplevel out of it only using 1st thread in workgroup. The results are (I've picked 3 samples of profiling output):
Dispatch[0] (256 256 256): 31.613440ms
Dispatch[1] (128 128 128): 4.366080ms
Dispatch[2] (64 64 64): 0.542720ms
Dispatch[3] (32 32 32): 0.068000ms
Dispatch[4] (16 16 16): 0.008640ms
Dispatch[5] (8 8 8): 0.002080ms
Dispatch[6] (4 4 4): 0.000480ms
Dispatch[7] (2 2 2): 0.000320ms
Total Time: 36.616640ms
Call overhead: 0.014880ms
Dispatch[0] (256 256 256): 29.836800ms
Dispatch[1] (128 128 128): 3.298880ms
Dispatch[2] (64 64 64): 0.412480ms
Dispatch[3] (32 32 32): 0.052000ms
Dispatch[4] (16 16 16): 0.007040ms
Dispatch[5] (8 8 8): 0.001280ms
Dispatch[6] (4 4 4): 0.000800ms
Dispatch[7] (2 2 2): 0.003520ms
Total Time: 33.616160ms
Call overhead: 0.003360ms
Dispatch[0] (256 256 256): 31.044640ms
Dispatch[1] (128 128 128): 3.807680ms
Dispatch[2] (64 64 64): 0.485600ms
Dispatch[3] (32 32 32): 0.066720ms
Dispatch[4] (16 16 16): 0.006560ms
Dispatch[5] (8 8 8): 0.001280ms
Dispatch[6] (4 4 4): 0.000800ms
Dispatch[7] (2 2 2): 0.004000ms
Total Time: 35.420960ms
Call overhead: 0.003680ms
2. Generate 2 levels at once
Using 4x4x4 workgroup always 2 miplevels at once are generated. If there is one last level (odd number of mipmaps), two smallest are re-generated. Uses groupshared memory - for first always "1st out of 2x2x2 sub-group" or 1st out of whole workgroup does the actual mipmapping:
Dispatch[0] (128 128 128): 8.049120ms
Dispatch[1] (32 32 32): 0.125120ms
Dispatch[2] (8 8 8): 0.004320ms
Dispatch[3] (2 2 2): 0.000320ms
Total Time: 8.180960ms
Call overhead: 0.002080ms
Dispatch[0] (128 128 128): 8.042560ms
Dispatch[1] (32 32 32): 0.125600ms
Dispatch[2] (8 8 8): 0.004160ms
Dispatch[3] (2 2 2): 0.000320ms
Total Time: 8.175040ms
Call overhead: 0.002400ms
Dispatch[0] (128 128 128): 7.860160ms
Dispatch[1] (32 32 32): 0.123840ms
Dispatch[2] (8 8 8): 0.004000ms
Dispatch[3] (2 2 2): 0.000320ms
Total Time: 7.990560ms
Call overhead: 0.002240ms
3. Generate 3 levels at once
As there is improvement, the logical step would be to go further and attempt to generate 3 levels at once, right? So I dispatch 8x8x8 workgroup and generate 3 miplevels at once (magic constants for masking are starting to be a bit of black magic here). Total number of Dispatch functions is down to 3:
Dispatch[0] (64 64 64): 18.375040ms
Dispatch[1] (8 8 8): 0.035040ms
Dispatch[2] (2 2 2): 0.000320ms
Total Time: 18.414240ms
Call overhead: 0.003840ms
Dispatch[0] (64 64 64): 16.648960ms
Dispatch[1] (8 8 8): 0.035360ms
Dispatch[2] (2 2 2): 0.000320ms
Total Time: 16.686240ms
Call overhead: 0.001600ms
Dispatch[0] (64 64 64): 20.219360ms
Dispatch[1] (8 8 8): 0.048960ms
Dispatch[2] (2 2 2): 0.000320ms
Total Time: 20.270560ms
Call overhead: 0.001920ms
Summary
As you can notice the timing got worse, I already think I have explanation for that (I'm not absolutely sure this is the case). In case where I generate 2 miplevels I use workgroup of 4x4x4 - each thread first stores data in groupshared memory, and then it is retrieved from apropriate locations and stored again (and thread_id location). Now as cache line for shared memory has 64-bytes, it is efficient.
For 8x8x8 when generating 3 miplevels at once this groupshared array has to be 512-bytes in size, therefore reading from it and writing from it at some locations will cause slow down.
The current code for 512^3 volume generates mipmaps in 8ms on AMD Rx 480 GPU. I still think that this is way too much, but maybe it is fast enough!
What next?
Thanks for advices (any further hint is of course welcome!). I might have also noted that my mip generation function is not just a sum (but counts voxels that contain something and uses that value as a divisor) - yet it is still just few operations, I might try to do the sum/counting in parallel reduction approach, but that will introduce even more memory barriers, which I'd suspect makes performance drop (but it is worth trying).
EDIT: Side not, parallel reduction won't work, adding more memory barriers will further decrease performance. As it is obviously memory bound (those 7 additions and 1 division are have literally no impact).
Lower resolutions?
Note, just out of curiosity, I took the one that generates 2 levels at once and put it in for 256^3 volume. The results are:
Dispatch[0] (64 64 64): 0.988000ms
Dispatch[1] (16 16 16): 0.016480ms
Dispatch[2] (4 4 4): 0.000480ms
Dispatch[3] (2 2 2): 0.000320ms
Total Time: 1.007360ms
Call overhead: 0.002080ms
Dispatch[0] (64 64 64): 0.974880ms
Dispatch[1] (16 16 16): 0.016480ms
Dispatch[2] (4 4 4): 0.000480ms
Dispatch[3] (2 2 2): 0.000320ms
Total Time: 0.994240ms
Call overhead: 0.002080ms
Dispatch[0] (64 64 64): 0.985120ms
Dispatch[1] (16 16 16): 0.016480ms
Dispatch[2] (4 4 4): 0.000480ms
Dispatch[3] (2 2 2): 0.000320ms
Total Time: 1.004640ms
Call overhead: 0.002240ms
Which is about 1ms for whole mip chain (and 8 times faster than doing the 512^3 volume).