Advertisement

Multithreaded Rendering

Started by January 04, 2019 05:28 PM
10 comments, last by ChuckNovice 6 years, 1 month ago

Hello everyone.

I am currently looking at remaking our rendering back-end from the ground up. The goal is to multi-thread rendering and to move to 2nd generation APIs. So i have been looking at those APIs (mostly Vulkan and somewhat DX12 so far which are quite similar) and i think i have a decent understanding of how they work.

Just to give the big picture, the front-end of the rendering system is responsible for implementing the rendering logic of different parts of the scene like terrain (LoD, view-frustum culling) etc and produces rendering commands for the back-end in the form of state objects and resources to bind. These objects are created using builder objects and can be built of multiple threads since they don't contain any GPU objects. When submitted to the back-end, the rendering thread just sets the states and performs the draw calls.

This design is very friendly to 2nd generation at a first glance. I did take a look at DOOM3-BFG Vulkan renderer and what they do is to use an array of "frame" objects (one for each image of the swapchain). Each has a command buffer and when drawing commands are submitted, they wait on the fence of the current frame (which most of the time will be finished) and record the command buffers on the presentation thread and they just ping-pong the two command buffers from frame to frame. It's easy but it doesn't leverage the API capabilities.

My idea is to use a similar frame mechanism for the back-end while building command buffers on other threads. The builder objects could be used by the front-end (taking care to make the render pass concept first class citizens in the builder API). The builder could just be making Vulkan objects directly and just produce opaque objects that the rendering thread would just have to submit.

At this point of my reflection the main issue i'm facing is the management of the command buffers life cycle since  they are allocated from command buffer pools. The rendering thread could reset the command buffer (by resetting the pool) and the handing it back to a queue when the front-end could get them but it would require that the pool be submitted with the command buffer and it would also require more synchronisation.

We think in generalities, but we live in details.
- Alfred North Whitehead

I'm by no means a Vulkan expert, but I'm in a similar position to you in my current attempts to use newer-gen graphics APIs effectively. From my understanding, the most optimal back-end multithreaded design might be something like the following. (Please anyone with more experience correct me if I'm totally off here).

Command pools shouldn't be re-used while existing command buffers are in flight, so starting with an approach like DOOM3-BFG is best (a command pool per swapchain image, entirely reset in one go using vkResetCommandPool instead of resetting command buffers individually). This is the bare minimum, and if you're only recording to command buffers from a single thread this is all you'll need.

If you want to record from multiple threads, things get a bit more complicated. Command pools (and command buffers allocated from them) are externally synchronized, so you'll want one per recording thread; this means you now have a multiplex of NxM command pools (N recording threads by M swap chain images). Once they're all recorded, you can then submit them all to the graphics queue using vkQueueSubmit - they'll be processed in the order they are in the array used as input. (I'm somewhat unclear about whether this is technically correct, but read here in the Vulkan spec and maybe more light bulbs will go off in your head than did in mine: https://vulkan.lunarg.com/doc/view/1.1.92.1/mac/vkspec.html#synchronization-submission-order).

This is where I'm currently stuck in planning: ensuring the correct order of things when recorded from multiple threads. The best solution I can come up with is to use a single 'main' recording thread for low-volume things (frame setup, post-processing), and only fork out to multiple threads for high-volume things (main rendering passes with lots and lots of commands). To ensure the correct ordering, I'm planning on having each thread record in linear chunks (worker 1 gets draws 0-511, worker 2 gets 512-1023, etc.) and then submitting each thread-owned command buffer in the correct array-order when passed to vkQueueSubmit.

EDIT: Something I just thought of which is maybe obvious to people more familiar with Vulkan is that my worker recording threads should be creating Secondary command buffers, and once they're done the main recording thread can then record these secondary buffers into its main Primary command buffer, and then just submit that.

Advertisement
10 hours ago, Valakor said:

EDIT: Something I just thought of which is maybe obvious to people more familiar with Vulkan is that my worker recording threads should be creating Secondary command buffers, and once they're done the main recording thread can then record these secondary buffers into its main Primary command buffer, and then just submit that.

 

This could be a good idea indeed but how would you manage the life cycle of the secondary command buffers ? I was looking at Sascha Willems multi-threading sample which uses secondary command buffers and i think there is a few things i didn't understand well about it.

We think in generalities, but we live in details.
- Alfred North Whitehead

Old pipeline - thread's for culling and render command's buffers generation. 1 thread for render.

Vulkan can do this by GAPI

https://github.com/SaschaWillems/Vulkan/blob/master/examples/multithreading/multithreading.cpp

https://developer.nvidia.com/sites/default/files/akamai/gameworks/blog/munich/mschott_vulkan_multi_threading.pdf

DirectX 12 here but the same should apply. I am not an expert either and still discovering things from that API but what I ended up doing is as already discussed. One instance of a cache per swapchain buffer with a command pool in each cache instance. The way your swapchain rotate guarantee that the current cache you're playing with is done executing on the GPU. Command buffers are pushed / popped in a stack and I assign them the frame they were used in when popping them from the stack. The stack is therefore sorted from the most recently used to the least recently used command buffer and I know for which frame they were last used which make it easy to clean up things efficiently. One instance of that stack exists for every classification of command buffer. In my case I allow specifying a light / medium or large command buffer when getting one from the pool so I can approximately reuse the same size. However what helped me the most regarding multithreading is using a more modular approach as described in this presentation :

https://www.gdcvault.com/play/1024612/FrameGraph-Extensible-Rendering-Architecture-in

A framegraph basically describes every operations that should be done to render a frame and is aware of every used resource to generate the frame which helps with resource states transitions / split barriers etc... 

Every module within a framegraph are responsible of a precise task and take specific inputs / outputs (Depth pass / GBuffer pass / Light pass / Post processing  / Downsampling module / Others useful tasks...) to feed one or many command lists that they receive from the framegraph.

 

From the point of view of the framegraph it becomes much easier to determine how many command list to create and what can be tasked in parallel.

 

51 minutes ago, ChuckNovice said:

One instance of a cache per swapchain buffer with a command pool in each cache instance. The way your swapchain rotate guarantee that the current cache you're playing with is done executing on the GPU

That's very nice ! I'll take a look at that presentation. Thank you very much.

We think in generalities, but we live in details.
- Alfred North Whitehead
Advertisement
51 minutes ago, Laval B said:

That's very nice ! I'll take a look at that presentation. Thank you very much.

Just to clear up some possible confusion. When I say that you are guaranteed that the frame is done processing, I am assuming a proper usage of the frame latency of the swapchain. A swapchain that is configured for triple buffering with a frame latency of 2 will allow you to have 2 pending Present() operations before conveniently blocking the waitable object that it gives you.

5 hours ago, ChuckNovice said:

Command buffers are pushed / popped in a stack and I assign them the frame they were used in when popping them from the stack. The stack is therefore sorted from the most recently used to the least recently used command buffer and I know for which frame they were last used which make it easy to clean up things efficiently.

Just to make sure i understand you correctly, the presentation doesn't occur until all command buffers have been constructed (in parallel) ?

We think in generalities, but we live in details.
- Alfred North Whitehead
4 minutes ago, Laval B said:

Just to make sure i understand you correctly, the presentation doesn't occur until all command buffers have been constructed (in parallel) ?

By presentation you mean the framegraph thing or calls to Present()?

 

I mean the call to present, the moment when you give your image back to the swap chain.

We think in generalities, but we live in details.
- Alfred North Whitehead

This topic is closed to new replies.

Advertisement