Is [numthreads(1, GROUP_SIZE, GROUP_SIZE)]
as efficient as [numthreads(GROUP_SIZE, GROUP_SIZE, 1)] ?
CUDA confused me by disabling their z dimension.
Is [numthreads(1, GROUP_SIZE, GROUP_SIZE)]
as efficient as [numthreads(GROUP_SIZE, GROUP_SIZE, 1)] ?
CUDA confused me by disabling their z dimension.
🧙
Personally i assume there is no hardware for those kind of dimensional thread partitioning at all, and it's just something that should make things easier for us.
I did not look at any ISA output to prove that, but i know that putting the thread ID into a register is faster than constantly reading it from the built in API variable (Vulkan and AMD), so i doupt there are 3 hardware registers holding 3 indices for nothing all the time.
Anyone knows?
Some GCN hardware have a halfed wave spawn rate if you use the Z dimension, not sure if it is still true or not. GCN again, there is an input vgpr per dimension and no combined one, at least on PS4 ( taken from a compute ISA s14 = s_tgid_x s15 = s_tgid_y v0 = v_thread_id_x v1 = v_thread_id_y ).
You could look at the ISA in Pix for AMD to confirm all that on PC.