6 hours ago, taby said:What are static command buffers?
What's the point of using glDispatchComputeIndirect?
Both is about avoiding expensive communication with CPU.
Lets say you have a task that determinates necessary work, e.g. frustum culling which writes a list of surviving triangles, or texture space shading which generates a list of blocks requiring update. Both tasks generate a variable list size each frame.
With indirect dispatch the list size can be used to set the necessery number workgroups from GPU directly, without it you need to download the list size to CPU to generate the dispatch command, and the command needs to be uploaded to GPU. Huge latencies and idle times caused by waiting.
Command buffers allow to store lots of commands (e.g. cull, draw, deferred shading) into a single buffer and upload it just once, so if you do the same work every frame (like in games), you could implement a complete renderer by just uploading object matrices and executing the command buffer. So the whole frame with just one singel draw call, avoiding all costly but useless CPU<->GPU communication we know from the past.
Like said this makes my complex CS project (realtime GI) almost two times faster - the more dispatches, the higher the win.
But unfortunately there is still something missing from both DX12/VK:
It may happen indirect dispatches end up doing zero work. But then expensive barriers are still executed for nothing. Neither DX12, nor VK Conditional Draw or NV Device Generated Command Buffers extension include support for barriers. Mantle has it, in form of conditional command buffer execution, and OpenCL 2 has it in form of device side enqueue. So i'm begging for that on every opportunity
And the second issue: If you want to use async compute (which also allows to run multiple compute tasks at the same time, not just render and a single compute task), you have to break your nice static command buffers into multiple smaller ones to feed multiple queues. This and sync between queues adds costly overhead, which can compensate the potential win for small workloads where you need it the most.
So what we want is GPU being able to generate its work completely independent from CPU. It's not clear if (or which) hardware could do this already, but looking at other APIs it's certain we miss at least some things.