Alright, did those 2 things on 3-level version, which dropped down to:
Dispatch[0] (64 64 64): 7.839040msDispatch[1] (8 8 8): 0.017760msDispatch[2] (2 2 2): 0.000320msTotal Time: 7.859200msCall overhead: 0.002080ms
Dispatch[0] (64 64 64): 7.937120msDispatch[1] (8 8 8): 0.018400msDispatch[2] (2 2 2): 0.000320msTotal Time: 7.957600msCall overhead: 0.001760ms
Dispatch[0] (64 64 64): 8.029280msDispatch[1] (8 8 8): 0.018560msDispatch[2] (2 2 2): 0.000320msTotal Time: 8.050240ms
Call overhead: 0.002080ms
Let me try the same on 2-level version.
EDIT: To answer you - they work all when loading from source, but right after only one of each 2x2x2 subgroup build 1st miplevel of the source - that's why the maskin is in that place.