Benefits of multithreaded renderer

ccherng · 2020-02-29T11:21:32

So I keep reading about multithreaded renderer but as a beginner its not clear to me what exactly is the benefit of it? Ultimately don't the calls to the gpu have to be done in a specific linear order?

Graphics and GPU Programming Programming

Started by ccherng February 25, 2020 02:15 AM

39 comments, last by NikiTo 4 years, 11 months ago

JoeJ

4,405

February 25, 2020 10:36 PM

NikiTo said:
If you want to use the copy engine at the same time as the compute engine, this is something you have to provide and do it manually.

Yes. That's why i don't think its the questioned topic here.

NikiTo said:
Resuming it, i think MT commands generation makes sense for very complex pipelines for full size AAA games.

Either this or a hobby game with 8000 zombies ; )

NikiTo

245

February 25, 2020 10:57 PM

“”"By merging calculations of the rats’ movement and reactions, the team was able to display more than 5,000 rats at once, and simulate an additional 5,000 behind the player. This was made possible by using multi-threading, with several processor cores working simultaneously."""

https://blog.eu.playstation.com/2019/05/09/how-a-plague-tale-innocences-horrifying-rat-swarm-was-created/

ccherng

Author

165

February 26, 2020 06:36 PM

So I was doing some searching on this general topic of multithreaded renderer and stumbled upon this article http://alinloghin.com/articles/command_buffer.html which I don't understand.

After the simulation of the game logic and physics for one frame is the point that converting this new state into appropriate gpu draw calls can be parallelized. For example I see in the diagram g-buffer, shadow, deferred, and post process command buffers whatever those are. Are each of these buffers filled in by a single thread or will there be multiple threads writing to each buffer and how will contention be dealt with such that the overhead doesn't overcome any parallelization benefit.

And more broadly is it the case that the overhead of dealing with commands buffers and then sorting to create the final list of commands to send to the gpu out weighed by the parallelization.

Secondly, as alluded to at the end of the article Vulkan and D3D12 support multithreading. Exactly how does this give benefits. Can the command buffer mentioned above avoid having to be fully sorted and have some commands issued independently and in parallel to the gpu and that is where the performance benefit comes. If that is the case then how does the gpu driver handle the synchronization of all these gpu calls coming in so that the overhead of this contention synchronization still do appreciable better than serially making all the gpu calls on one thread.

JoeJ

4,405

February 26, 2020 07:14 PM

Quickly looking at the link i see it's about OpenGL. As this API has no native support for MT it may be about workarounds that are no longer necessary with DX12/VK.

IIRC, in VK you generate one (or more) command buffer per thread. There also other resources, like descriptor pools that are used per thread. (But i really forgot everything i once knew about VK already.)
So one thread has all it needs to prepare work for GPU without a need for synchronization.

After all command buffers are done, the render thread can submit them in required order to GPU. This will execute faster than doing all the CB generation work from one thread.

NikiTo

245

February 26, 2020 07:49 PM

Every command list makes some tests before compiling. If your commands are bad, it will not compile.

For the API it is easy to take care of not mixing the command lists in a wrong way. For example, the list will tell the GPU to load the Root Signature and then will tell it to dispatch the shader that uses that root signature. Think of it - It is easy to manage.

The API will mix commands at the level of RootSignature+PSO+dispatch. This union will not be broken.

Thread0: RootSignature+PSO+dispatch
Thread1: RootSignature+PSO+dispatch
App1: RootSignature+PSO+dispatch
App0: RootSignature+PSO+dispatch
OS: RootSignature+PSO+dispatch
App1: RootSignature+PSO+dispatch

Notice, the OS constantly introduces its own dispatches into the serial stream of dispatches. The GPU doesn't complain.

The memory resources are owned by the app, and no other app will look into the VRAM of other apps. When the app dies, the VRAM resources are disowned.

You can have two games running in two windowses and both will be working at the same time. No problem.

MJP

20,297

February 27, 2020 07:33 AM

ccherng said:
Secondly, as alluded to at the end of the article Vulkan and D3D12 support multithreading. Exactly how does this give benefits. Can the command buffer mentioned above avoid having to be fully sorted and have some commands issued independently and in parallel to the gpu and that is where the performance benefit comes. If that is the case then how does the gpu driver handle the synchronization of all these gpu calls coming in so that the overhead of this contention synchronization still do appreciable better than serially making all the gpu calls on one thread.

Let's say you need to issue 10,000 draw calls to render your frame, and each draw call takes roughly 1 microsecond. Doing all of those on a single thread means that it will take about 10 milliseconds to issue all of those draw calls, which would be more than half of a 16.6ms (60 Hz) frame. Without API support for multithreading (which is what you're stuck with in D3D11 and OpenGL) you can possibly do other work simultaneously on other cores while issuing all of those draw calls, but it's always going to take 10ms from when you start issuing draw calls until when you finish. This means it's impossible for you to run at full framerate on a 144 Hz monitor for instance, since you would need to get below 6.94ms for that to happen. Or alternatively if you increased to 20,000 draw calls you wouldn't hit 60Hz, since now you need 20ms to issue those draw calls.

With D3D12/Vulkan you can actually spread the work of issuing those draw calls over multiple threads. This means that if you have 4 cores completely available to you when it comes time to issue draw calls, you could crunch through those draws in only 2.5ms (at least in an idealized world with perfect parallelism and no issues from cache/memory contention or downclocking). Therefore you actually have a chance of hitting that 144 Hz target, assuming you can do the rest of the frame's work in 4.5ms. This also lets you potentially achieve lower latency than other techniques that can be used for achieving parallelism, in particular the “render thread” approach where the rendering thread issues draw calls a frame behind the gameplay code.

GPUs consume draws and other commands in large batches of commands encoded in a chunk of memory called a command buffer. So the CPU doesn't really feed the GPU 1 draw at a time, instead the CPU batches up hundred or thousands of commands into a buffer that's then submitted to the GPU at a later time. In D3D11 and OpenGL this is all hidden from you, but in D3D12 and Vulkan it's something you explicitly handle yourself. The way multithreading typically works with those APIs is that each command buffer can only be written by 1 thread at once. So you might break up your frame into say a dozen or so command buffers, and for each kick off a task on a multithreaded job scheduler that fills in those command buffers with draws and other commands. When all of those tasks complete you can then submit those command buffers to the GPU in a single function call. So there's not really much contention to worry about in this case: the GPU ultimately just executes a serial list of commands. Where things get more complicated is when you submit command buffers to multiple hardware queues, but that's a much more advanced topic.

The Blog | The Book

ccherng

Author

165

February 27, 2020 09:38 AM

@MJP But I don't understand why one would be calling 10000 draw calls. In opengl aren't you suppose to do something like a batch vertex call with a single glDrawArrays call.

MJP

20,297

February 27, 2020 10:00 PM

You would do it if you have 10,000 things to draw separately. ?

There are many tradeoffs involved in decided how to batch/split up draw calls. Batching reduces your draw call count, but will limit your ability to switch shaders or make other state changes. So you may end up trading off CPU and GPU performance when deciding how much to batch things. Batching can also have wide-reaching effects on how you author and process your content, and also how you your engine handles that content at runtime.

A lot of the older advice for batching up draws came about because draw calls were expensive from a CPU point of view, and couldn't be multithreaded. DX12 and Vulkan can help with both of those problems, which gives you more options in choosing how to do things. You also have to keep in mind that in the earlier days of GPUs it could also be quite expensive from a GPU point of view to have state changes (like changing textures) since that would cause GPU sync points which reduce utilization. These days GPUs are much better at handling that, and often don't suffer a penalty.

Either way it's up to you to reason what your game(s) are going to need from an engine, and use that information to plan out your tech. If you don't think you'll need lots of draw calls then you probably don't need to spend all of the effort required to multithread them!

The Blog | The Book

ccherng

Author

165

February 28, 2020 02:33 AM

@MJP So is it the case that the older advice for batching up draws was because the overhead cost of invoking functions added up to be not insignificant. You implied that if you wanted to draw all 10000 objects with slightly different shader settings that is impossible to do with a single batch call. Does this mean there is no way to make a single call that says here is all my vertex settings for my 10000 objects and here are the shader settings and everything else all supplied in one big array of information. Instead you have to make multiple calls and pay the overhead cost of all those function calls. Is that something inherent in how things have to be engineered? Why can't it be designed so as to minimize overhead cost just like analogously how its well known you want to avoid making unnecessary system calls into the os kernel because the overhead of a context switch into the kernel is enormous.

NikiTo

245

February 28, 2020 03:06 AM

@ccherng The DX12/Vulkan APIs are very well engineered.

And you are right to want to pile all the similar stuff in thematic arrays and call a single draw per array.

But imagine you are rendering a 3D VR chat room. Last time i watched chatrooms on YT, every single character had different shaders on their characters. Not only different parameters for the standard materials, but different shaders for their materials. A lot a lot a lot of options. All kind of materials and even physics/animation effects that vary from a character to another. A lot of non standard shaders. Many of them on many characters. Lot of diversity.

Maybe you will not need 10000 draw calls, but a 3D VR chat room is the best example i can come up with so far, where you can not avoid having a ton of draws.

Benefits of multithreaded renderer

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Benefits of multithreaded renderer

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines