hello,
with directx12 SM6.0 Microsoft introduced new shader wave intrinsics
e.g.
Returns the bitwise OR of all the values of the expression across all active lanes in the current wave, and replicates the result to all lanes in the wave.
How many lanes can “all lanes” be at max ?
Suppose i have a compute shader with typical WAV SIZE for NVIDIA or AMD
- ThreadGroup (4,8,1) then 4 x 8 = 32 Threads are started per Group ( maybe typical NVIDIA HW )
- ThreadGroup (8,8,1) then 8 x 8 = 64 Threads are started per Group ( maybe typical AMD HW )
Does the function(s) only perform good if i choose the above typical values that the gpu by harware can support or can i sync Thread crossing data with higher thread group Thread count
e.g ThreadGroup (16,16,1) then 16 x 16 = 256 Threads are started per Group ( greater tiles )
I ask because i am deeply working with AMD GPUOpen Code on Denoise Hybrid Raytraced Shadows
Only for the typical AMD “wave size” 64 the intrinsic is called directly.
The alternative path they have to sync manually with a groupshared memory variable.
It makes me wonder that they dont check WaveGetLaneCount() < lane_count_in_thread_group
bool FFX_DNSR_Shadows_ThreadGroupAllTrue(bool val)
{
const uint lane_count_in_thread_group = 64;
if (WaveGetLaneCount() == lane_count_in_thread_group)
{
return WaveActiveAllTrue(val);
}
else
{
GroupMemoryBarrierWithGroupSync();
g_FFX_DNSR_Shadows_false_count = 0;
GroupMemoryBarrierWithGroupSync();
if (!val) g_FFX_DNSR_Shadows_false_count = 1;
GroupMemoryBarrierWithGroupSync();
return g_FFX_DNSR_Shadows_false_count == 0;
}
}