Just now, Infinisearch said:
Might that behavior change in the future? Or is that entirely in AMD's hand? Is nvidia any different?
I remember reading somewhere AMD seems to like a thread group size of at least 256, am I misremembering? Doesn't this have something to do with hiding memory latency? I'm not that experienced with compute shader's yet and my memory of what I did learn isn't that great.
It seems reasonable to imagine it /could/ change in the future (or may even have already changed?), but that's up to the hardware, it's not a D3D/HLSL thing. No idea what the behaviour is on other IHVs.
I don't think the size of a thread group has much bearing on being able to hide memory latency per se. Obviously you want to make sure the hardware has enough waves to be able to switch between them (4 per SIMD, 16 per CU) is a reasonable target to aim for. But whether that's 16 waves from 16 different thread groups or 16 waves from a single thread group doesn't matter too much so long as enough of them are making forward progress and executing instructions while others wait on memory.
You don't want to be writing a 1024 thread thread group (16 waves on AMD) where one wave takes on the lion's share of the work while the other 15 sit around stalled on barriers, that's not going to help you hide latency at all. There's nothing inherently wrong with larger thread groups, you just need to be aware of how the waves get scheduled and ensure that you don't have too many waves sitting around doing nothing.