Question concerning internal queue organisation

Graphics and GPU Programming Programming Vulkan C++ Advice

Started by Green_Baron July 10, 2017 08:05 AM

24 comments, last by JoeJ 7 years, 6 months ago

Green_Baron

Author

221

July 10, 2017 08:05 AM

Hello,

my first post here :-)

About half a year ago i started with C++ (did a little C before) and poking into graphics programming. Right now i am digging through the various vulkan tutorials.

A probably naive question that arose is:

If i have a device (in my case a GTX970 clone) that exposes on each of two gpus two families, one with 16 queues for graphics, compute, etc and another one with a single transfer queue, do i loose potential performance if i only use 1 of the 16 graphics queues ? Or, in other words, are these queues represented by hardware or logical entities ?

And how is that handled across different vendors ? Do intel and amd handle this similar or would a program have to take care of different handling across different hardware ?

Cheers

JoeJ

4,406

July 10, 2017 10:29 AM

Yes, this is very vendor specific.

On AMD you can use multiole queues to do async compute (e.g. doing compute shader and shadow maps rendering at the same time).

You can also do multiplie compute shadres at the same time, but it's also likely that's slower than doing them in order in a singe queue.

On NV the first option is possible on recent cards, but the second option is not possible -and they will serialize internally (AFAIK - not sure)

On both Vendors it makes sense to use a different queue for data transfer, e.g. a streaming system running while rendering.

Not sure about Intel, IFAIK they recommend to just use a single queue for everything.

In practice you need a good reason to use multiple queues, test on each HW, and use differnet settings for different HW.

E.g. for multithreaded command buffer generation you don't need multiple queues and queue per thread would be a bad idea.

Green_Baron

Author

221

July 10, 2017 02:16 PM

Thanks. So i understand that a single graphics queue is the best solution.

Yeah, i could split the 2*16 queues freely among graphics, compute, transfer and sparse, and the family with the single q is transfer only. Like this, but two times for two devices:

VkQueueFamilyProperties[0]:
===========================
        queueFlags         = GRAPHICS | COMPUTE | TRANSFER | SPARSE
        queueCount         = 16
        timestampValidBits = 64
        minImageTransferGranularity = (1, 1, 1)

VkQueueFamilyProperties[1]:
===========================
        queueFlags         = TRANSFER
        queueCount         = 1
        timestampValidBits = 64
        minImageTransferGranularity = (1, 1, 1)

I am not that far as to test anything on different platforms/devices. My "training" pc is a debian linux one. But in principle and if one day i shall do a basic framework for my own i would of course aim towards a solution that is robust and works for different platforms / manufacturers. That would probably be a compromise and not the ideal one for every case.

JoeJ

4,406

July 10, 2017 03:41 PM

1 hour ago, Green_Baron said:
Thanks. So i understand that a single graphics queue is the best solution.

Probably. I'm no graphics pipeline expert, but i'm not aware of a case where using two graphics queues can make sense. (Interested, if anybody else does)

It also makes sense to use 1 graphics queue, 1 upload queue, 1 download queue on each GPU to communicate (although you don't have this option because you have only one seperate transfer queue).

And it makes sense to use multiple compute queues on some Hardware.

I proofed that GCN can perfectly overlap low work compute workloads, but the need to use multiple queues, so multiple command buffers and to sync between them destroyed the advsante in my case.

Personally i think the concept of queues is much too high level and totally sucks. It would be great if we could manage unique CUs much more low level. The hardware can do it but we have no access - VK/DX12 is just a start...

Ryan_001

3,477

July 10, 2017 04:57 PM

I haven't quite figured out the point/idea behind queue families. Its clear that all queue's of a given family share hardware. Also a GPU is allowed a lot of leeway to rearrange commands both within in command buffers, and across command buffers within the same queue. So queue's from separate queue families are most likely separate pieces of hardware, but are queue's from the same family? I've never been able to get a straight answer on this, but my gut feeling is no.

For example AMD has 3 queue families, so if you create one queue for each family (one for graphics, one for compute, and one for transfer) you can probably get better performance. But is it possible to get significantly better performance with multiple queues from the same queue family? So far from what I've been able to gather online, is probably not.

While I do agree with JoeJ that queue's are poorly designed in vulkan, I don't think direct control of CUs makes sense IMHO. I think queue's should be essentially driver/software entities. So when you create a device you select how many queue's of what capabilities you need, the driver gives you them and maps them to hardware entities however it feels is best. Sort of like how on the CPU we create threads and the OS maps them to cores. No notion of queue families. No need to query what queue's exist and try to map them to what you want.

TBH, until they clean this part of the spec up, or at least provide some documentation on what they had in mind, I feel like most people are just going to create 1 graphics queue, 1 transfer queue, and just ignore the rest.

Green_Baron

Author

221

July 10, 2017 07:14 PM

I'm just beginning to understand how this works, am far from asking "why". And for a newcomer vulkan is a little steep in the beginning and some things seem highly theoretic (like graphics without presentation or so).

Thanks for the answers, seems like i'm on the right track :-)

Hodgman

52,718

July 12, 2017 06:14 AM

On 7/10/2017 at 6:05 PM, Green_Baron said:
do i loose potential performance if i only use 1 of the 16 graphics queues ? Or, in other words, are these queues represented by hardware or logical entities ?

No, you probably only need 1 queue. They're (hopefully) hardware entities. If you want different bits of your GPU work to be able to run in parallel, then you could use different queues, but you probably have no need for that.

For example, if you launch two different apps at the same time, Windows may make sure that each of them is running on a different hardware queue, which could make them more responsive / less likely to get in each other's way.

On 7/11/2017 at 1:41 AM, JoeJ said:
I'm no graphics pipeline expert, but i'm not aware of a case where using two graphics queues can make sense. (Interested, if anybody else does)

In the future when vendors start making GPU's that can actually run multiple command buffers in parallel to each other, then you could use it in the same way that AMD's async compute works.

On 7/11/2017 at 2:57 AM, Ryan_001 said:
I haven't quite figured out the point/idea behind queue families.

To use OOP as an analogy, a family is a class and a queue is an instance (object) of that class.

On 7/11/2017 at 1:41 AM, JoeJ said:
Personally i think the concept of queues is much too high level and totally sucks. It would be great if we could manage unique CUs much more low level. The hardware can do it but we have no access - VK/DX12 is just a start...

Are you sure about that? AFAIK the queues are an abstraction of the GPU's command engine, which receives draws/dispatches and hands them over to an internal fixed function scheduler.

. 22 Racing Series .

JoeJ

4,406

July 12, 2017 09:00 AM

2 hours ago, Hodgman said:

On 10.7.2017 at 5:41 PM, JoeJ said:
Personally i think the concept of queues is much too high level and totally sucks. It would be great if we could manage unique CUs much more low level. The hardware can do it but we have no access - VK/DX12 is just a start...
Are you sure about that? AFAIK the queues are an abstraction of the GPU's command engine, which receives draws/dispatches and hands them over to an internal fixed function scheduler.

I would have nothing aginst the queue concept, if only it would work.

You can look at a my testproject i have submitted to AMD: https://github.com/JoeJGit/OpenCL_Fiji_Bug_Report/blob/master/async_test_project.rar

...if you are bored, but here is what i found:

You can run 3 low work tasks without synchornizition perfectly parallel, yeah - awesome.

As soon as you add sync, which is only possible by using semaphores, the advantage gets lost due to bubbles. (Maybe semaphores sync with CPU as well? If so we have a terrible situation here! We need GPU only sync between queues.)

And here comes the best: If you try larger workloads, e.g. 3 tasks with runtimes of 0.2ms, 1ms, 1ms without async, going async the first and second task run parallel as expected, although 1ms become 2ms, so there is no win. But the third task raises to 2ms as well, even it runs alone and nothing else - it's runtime is doudled for nothing.

It seems there is no dynamic work balancing happening here - looks like the GPU gets divided somehow and refuses to merge back when possible.

2 hours ago, Hodgman said:
AFAIK the queues are an abstraction of the GPU's command engine, which receives draws/dispatches and hands them over to an internal fixed function scheduler.

Guess not, the numbers don't match. A Fiji has 8 ACEs (if thats the correct name), but i see only 4 compute queues (1gfx/CS+3CS). Nobody knows what happens under the hood, but it needs more work, at least on the drivers.

Access to unique CUs should not be necessary, you're right guys. But i would be willing to tackle this if it would be an improvement.

There are two situations where async compute makes sense:

1. Doing compute while doing ALU light rendering work (Not yet tried - all my hope goues in to this, but net everyone has rendering work.)

2. Paralellizing and synchronizing low work compute tasks - extremely important if we look towards more complex algotrithms reducing work instead to brute force everything. And sadly this fails yet.

Hodgman

52,718

July 12, 2017 10:45 PM

I think part of your disappointment is the assumption that the GPU won't already be running computations async in parallel in the first place, which means that you expect "async compute" to give a huge boost, when you've actually gotten that boost already.

In a regular situation, if you submit two dispatch calls "A" and "B" sequentially, which each contain 8 wavefronts, the GPU's timeline will hopefully look like this:

Where it's working on both A and B concurrently.

If you go and add any kind of resource transition or sync between those two dispatches, then you end up with a timeline that looks more like:

If you simply want the GPU to work on as many compute tasks back to back without any bubbles, then the new tool in Vulkan for optimizing that situation is manual control over barriers. D3D11/GL will use barriers all over the place where they aren't required (which creates these bubbles and disables concurrent processing of multiple dispatch calls), but Vulkan gives you to the power to specify exactly when they're required.

Using multiple queues is not required for this optimization. The use of multiple queues requires the use of extra barriers and syncrhonisation, which is the opposite of what you want. As you mention, a good use for a seperate compute queue is so that you can keep the CU's fed while a rasterizer-heavy draw command list is being processed.

Also take note that the structure of these timelines makes profiling the time taken by your stages quite difficult. Note that the front-end processes "B" in between "A" and "A - end of pipe"! If you time from when A reaches the front of the pipe to when it reaches the end of the pipe, you'll also be counting some time taken by the "B" command! If you count the time from when "A" enters the pipe until when "B" enters the pipe, then your timings will be much shorter than reality. The more internal parallelism that you're getting out of the GPU, the more incorrect your timings of individual draws/dispatches will be. Remember to keep that in mind when analyzing any timing data that you collect.

. 22 Racing Series .

JoeJ

4,406

July 13, 2017 09:23 PM

Whooo! - I already thaught the driver could figure out a dependency graph und do things async automatically, but i also thought this being reality would be wishfull thinking.

This is too good to be true, so i'm still not ready to believe it

(Actually i have too much barriers, but soon i'll be able to push more independent work to the queue and i'm curious if i'll get a lot of it for free...)

Awesome! Thanks, Hodgman

Question concerning internal queue organisation

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Question concerning internal queue organisation

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines