Advertisement

D3D11: using DirectCompute in multiple threads

Started by February 27, 2025 07:49 PM
7 comments, last by JoeJ 35 minutes ago

I haven't been able to find a good answer regarding using DirectCompute on multiple threads in a program. I'm trying to add DirectCompute functionality to my program (a scientific app, not a game) to help with some of my algorithms. However, some of these algorithms can be run on both the UI thread, which uses D3D11 for rendering, and on a background thread. The Dispatch() call for a compute shader needs to be called on the D3D11DeviceContext, which should only be used on a single thread at a time, so I won't be able to use a single device+context across multiple threads. Some questions:

  1. I will need to create a device+context for each thread that's going to use DirectCompute, correct?
  2. A D3D11Device is not tied to the thread that created it, correct?
  3. In my UI thread, which has its own device+context for rendering, I can use a second device+context for DirectCompute, correct?

My thinking is to have a small cache of D3D11Device+D3D11DeviceContext structures that can be grabbed when needed for DirectCompute and released back to the cache when done. Does that make sense, and will it work?

Thanks

You can try using deferred contexts or guarding your device context with a mutex.

Advertisement

+1 For what @aerodactyl55 said, deferred context is the way. However, its seems like you are trying to solve issues that you have not yet encounter, meaning you are trying to optimize prematurely. Is it possible a single thread would work, and reduce complexity? Also, if there is only a single GPU in the system, all you are really doing is threading the command list submission, there is only 1 GPU doing all the work.

Thanks for the replies.

I really don't want to try to share a single D3D11 device between threads. I'd prefer for Windows to manage sharing the hardware between threads (saving/restoring hardware states, etc. as needed). That way I can weave in the DirectCompute work freely (meaning, without concern for what's happening in other threads) in my algorithms. An analogy: I can use the FPU freely in any thread without having to explicitly “share” it. That's what I want to do with DirectCompute.

Note: it's been two decades since I wrote OS graphics driver code, so I don't know what goes on under the OS covers with GPUs. I can't seem to find that information anywhere. If a GPU can't be shared freely then I need to move on to a different solution (probably ignore DirectCompute completely).

Thanks again

Advertisement

Using multiple devices is going to create additional complexity as the resources own by each device ( shaders, textures etc) are not shared by default, which means additional work will be required.

Using a per-thread deferred context with a single device is probably gonna be path of least resistance imo.

As per the FPU analogy, the physical FPU may be shared, but the compiler still had to generate code for preserve FPU states for the current stack frame, so the ‘sharing’ appears seamless. The paradigm for GPU is different since the FPU is just a single functional unit of the CPU. The GPU analogy would be the streaming processors, which the GPU compiler handles more or less like the FPU( scheduling work etc)

shader25 said:
An analogy: I can use the FPU freely in any thread without having to explicitly “share” it. That's what I want to do with DirectCompute.

If there is no compute interaction with graphics, maybe you could achieve this by using specific compute APIs beside DX11. Then you would leave it to OS and driver to schedule the workloads. Beside Cuda or OpenCL, DX12 or Vulkan can also be used as compute only APIs afaict.

However, idk which of those APIs support independent contexts for multiple threads.
In games all we want is to generate command lists from multiple threads, but then enqueing them all from the render thread once per frame, so we have control over parallel execution on GPU using the given synchronization options.

Personally i would not rule out this option so quickly. Your threads could enqueue a compute task, then go idle and become notified whenever the work is done, or keep running and poll for results being ready. Not much work to implement, and behavior should be the same as if the OS would do the same thing under the hood.

Advertisement