Advertisement

Discarding threads in compute engine

Started by May 26, 2018 04:37 PM
4 comments, last by Hodgman 6 years, 8 months ago

I was working with pixel shaders until now. But a moment have come, where I may have to switch to compute shaders(using the results of the pixel shaders in compute shaders). And I am struggling with the design of the implementation.
Until now I was using geometry and stencils/depths to discard texels. But I don't know how to discard threads in order to save work in the compute shaders. At least I can't see an elegant/natural way to do it.

I could try to use tiles to send only the required work to the compute engine, but many of those tiles are just few threads in total. I think this could be bad for performance, as I am not getting all the processors in the GPU to work all the time.

Please excuse me, I am touching compute shaders for first time, and most probably I am missing something.

For example, if a game designer decides to not simulate the hair that is not visible(or simulate it with less precision), how he should design the solution?
If the player is looking to his feet, how to simulate only the few visible waves, not the "whole" ocean? It seems I can send to the compute engine, only boxes of workload.

(At the cost of few more reads/writes, I can solve my problem keeping using pixel shaders, so I am doubting)

I am not sure what you are asking. You can either skip threads in a compute shader by:

  • Not calling Dispatch, or calling it with 0 for any dimension of the thread groups
  • return; statement in the shader

For computing only visible surfaces, that is not a general question that can be solved trivially. It is you, your logic what ensures what you compute. Also, indirect dispatch can be utilized when you want to remain entirely on the GPU, and when the CPU doesn't know how many thread groups need to be dispatched. It works by having a previous shader write a dispatch argument buffer, containing thread group count values (three uints) and providing that buffer to DispatchIndirect API call. 

Advertisement
1 hour ago, NikiTo said:

(At the cost of few more reads/writes, I can solve my problem keeping using pixel shaders, so I am doubting)

Get rid of your doupts and try it :)

You can do very fine grained work distribution with compute shaders, e.g. from one shader writing to multiple work lists and setting their size as indirect dispatch counts for later shaders. Trees, or other kinds of acceleration structures can be built and processed entirely on GPU, using DX12/VK even with executing a single commad buffer, which contains all dispatches and necessary barriers. 

2 hours ago, NikiTo said:

I could try to use tiles to send only the required work to the compute engine, but many of those tiles are just few threads in total. I think this could be bad for performance, as I am not getting all the processors in the GPU to work all the time.

It's often a problem GPU can not be saturated if we do only necessary work, but that's no excuse to prefer brute force. Instead you should aim for async compute. GCN can process multiple compute dispatches in parallel using multiple queues, and even in a single queue if no barriers get in the way, all this while graphics pipeline is working as well. This helps to minimize the small workload problem. (I don't know about Nvidia.) However, it's difficult to tune and requires a lot of experimentation and customization for individual hardware.

2 hours ago, NikiTo said:

For example, if a game designer decides to not simulate the hair that is not visible(or simulate it with less precision), how he should design the solution?
If the player is looking to his feet, how to simulate only the few visible waves, not the "whole" ocean? It seems I can send to the compute engine, only boxes of workload.

Using a shader, which does frustum culling and directs work to following shaders using indirect dispatches as said. This can often cause zero work dispatches, which has some cost mainly because barriers will still be executed even if there is no need (currently we can not skip over barriers on GPU - that's something still missing from low level APIs). Again async compute can help to fill those bubbles if we use multiple queues.

... but all this is meant for very aggressive optimization if necessary.

The real advantage of compute shaders is you can do pretty much anything as long as parallelization makes sense. Pixel shaders in contrast run in isolation to other threads - they are totally dump and just brute force.

 

Thank you, all!
It starts to get so complicated it becomes discouraging.

If you run a pixel shader on 3 pixels, it's wasteful too as the GPUs hardware will always work on 64/32/8 pixels at a time (amd/nv/Intel). 

If you're talking about using the GPU to determine what work needs to be done - e.g. The vertex shader places some triangles on-screen, and some off-screen, and no pixel shaders are executed for the off-screen ones. 

Then in that case, look into DrawIndirect / DispatchIndirect / ExecuteIndirect. Instead of the CPU specifying how many thread-groups / triangles to draw/Dispatch, instead, the CPU specifies a buffer where the information will later be available. You can use an earlier dispatch to fill in this buffer with the actual arguments. This lets you implement things like the VS->PS pipeline mentioned above; one compute shader that outputs the size of the workload for another computer shader. 

This topic is closed to new replies.

Advertisement