Advertisement

Question concerning internal queue organisation

Started by July 10, 2017 08:05 AM
24 comments, last by JoeJ 7 years, 6 months ago

The following quote from spec:

Quote

Command buffer boundaries, both between primary command buffers of the same or different batches or submissions as well as between primary and secondary command buffers, do not introduce any additional ordering constraints. In other words, submitting the set of command buffers (which can include executing secondary command buffers) between any semaphore or fence operations execute the recorded commands as if they had all been recorded into a single primary command buffer, except that the current state is reset on each boundary. Explicit ordering constraints can be expressed with explicit synchronization primitives.

I read this as meaning that for a single queue command buffer boundaries don't really matter, at least where pipeline barriers are concerned.  A pipeline barrier in one command buffer will halt execution, not only of commands within that command buffer, but also any subsequent commands in subsequent command buffers (of course only for those stages that the barrier applies to), even if those subsequent command buffers are independent and could be executed simultaneously.  So if the work is truly independent, I could see there being a small potential performance increase when using multiple queue's.

That said I feel this was an error/oversight IMHO.  It seems clear (at least to me : ) that semaphores are the go-to primitive for synchronization between separate command buffers, and hence it would make sense to me that pipeline barriers operate only within a particular command buffer.  This way independent command buffers on the same queue could be fully independent, and there would be no need for other queue's.  Alas, this is not the case.  Perhaps they had their reasons for not doing that, its often hard to read-between the lines and understand the 'why' from a specification.  I guess to me, its why queue's feel like such a mess.  Vulkan has multiple ways of doing essentially the same thing, it feels like the spec is stepping on its own toes.  But perhaps its just my OCD wanting a more ortho-normal API.

I tried to verify with my test...

I have 3 shaders:

1: 10 dispatches of 50 wavefronts

2: 50 dispatches of 50 wavefronts

3: 50 dispatches of 50 wavefronts

With memory barrier after each dispatch i get 0.462 ms

Without: 0.012 ms (speed up of 38.5)

To verify i use 1 dispatch of 5500 wavefronts (same work): 0.013 ms

 

So yes, not only the GPU is capable of doing async compute perfectly with a single queue, we also see to API overhead of multiple dispatches is zero :)

Finally i understand why memory barriers appeared so expensive to me. Shame on me and all disappointment gone :D

Advertisement
7 hours ago, Ryan_001 said:

 So if the work is truly independent, I could see there being a small potential performance increase when using multiple queue's.

If the work is truly independent, you won't have any barriers and could use one queue just fine.

1 hour ago, Hodgman said:
9 hours ago, Ryan_001 said:

 So if the work is truly independent, I could see there being a small potential performance increase when using multiple queue's.

If the work is truly independent, you won't have any barriers and could use one queue just fine.

Yes, using one queue is faster even if there are no memory barriers / semaphores. Submitting a command buffer has a noticeable cost, so putting all work in one command buffer and one queue is the fastest way.

I also tested to use queues from different families vs. using all from one family which had no effect on performance.

All tests with only compute shaders.

 

Now i don't see a need to use multiple queues other than for up / downloading data. Maybe using 2 queues if we want to do compute and graphics makes sense but i guess 1 queue is better here too.

 

Edit: Maybe using multiple queues results in dividing work strictly between CUs, while using one queue can distribute multiple dispatches on the same CU. If so, maybe we could avoid some cache thrashing my grouping work with similar memory access together. But i guess cases where this wins would be extremely rare.

 

3 hours ago, Hodgman said:

If the work is truly independent, you won't have any barriers and could use one queue just fine.

With all due respect, not necessarily.  Assuming I'm reading the spec correctly...

Imagine you had 3 units of work, each in their own command buffer, and each fully independent from the other 2.  Now within each command buffer there is still a pipeline barrier because while each command buffer is completely independent from the others, there are dependencies within the commands of each individual command buffer.  You could submit these 3 command buffers to 3 different queue's and they could run (theoretically) asynchronously/in any order.

Now if pipeline barriers were restricted to a given command buffer, then submitting these 3 commands buffers to a single queue would also yield asynchronous performance.   But as it stands, submitting these 3 commands to a single queue will cause stalls/bubbles because pipeline barriers work across command buffer boundaries.  The pipeline barrier in command buffer 1 will not only cause commands in buffer 1 to wait but also commands in buffers 2 and 3, even though those commands are independent and need not wait on the pipeline barrier.

This change would also give a bit of purpose to secondary command buffers, of which (at this time) I see little use for.

Now I just need to convince the Vulkan committee of what a great idea retroactively changing the spec is, and that breaking everyone's code is no big deal.  /sarcasm

6 minutes ago, Ryan_001 said:

Imagine you had 3 units of work, each in their own command buffer, and each fully independent from the other 2.  Now within each command buffer there is still a pipeline barrier because while each command buffer is completely independent from the others, there are dependencies within the commands of each individual command buffer.  

Ah ok I didn't get the last assumption -- that the big picture work of each buffer is independent, but contain internal dependencies.

Yes, in theory multiple queues could be used to allow commands from another queue to be serviced while one queue is blocked doing some kind of barrier work. In practice on current hardware I don't know if this makes any difference though -- the "barrier work" will usually be made up of a command along the lines of "halt the front-end from processing any command from any queue until all write-through traffic has actually been flushed from the L2 cache to RAM"... In the future there may be a use for this though.

I don't know if the Vulkan spec allows for it, but another use of multiple queues is prioritization. If a background app is using the GPU at the same time as a game, it would be wise for the driver to set the game's queues as high priority and the background app's queues as low priority. Likewise if your gameplay code itself uses GPU compute, you could issue it's commands via  a "highest/realtime" priority queue which is configured to immediately interrupt any graphics work and do the compute work immediately -- which would allow you to perform GPGPU calculations without the typical one-frame delay. Again, I don't know if this is possible (yet) on PC's either.

12 minutes ago, Ryan_001 said:

This change would also give a bit of purpose to secondary command buffers, of which (at this time) I see little use for.

 

AFAIK, they're similar to "bundles" in D3D12 or display lists in GL, which are meant for saving on the CPU cost of repeatedly re-recording draw commands for a particular model every frame, and instead re-using a micro command buffer over many frames.

Advertisement

Well barriers in the spec are a little more fine grained, you can pick the actual pipeline stages to halt on.  For example if you wrote to a buffer from the fragment shader, and then read it from the vertex shader, you would put a pipeline barrier which would halt all subsequent vertex shader (and later stages) from executing prior to the fragment shader complete'ing.  But I have the feeling you were talking about what hardware actually does?  In which case you are probably right, I have no idea how fine-grained the hardware really is.

The spec does support queue priority, sort of:

Quote

4.3.4. Queue Priority

Each queue is assigned a priority, as set in the VkDeviceQueueCreateInfo structures when creating the device. The priority of each queue is a normalized floating point value between 0.0 and 1.0, which is then translated to a discrete priority level by the implementation. Higher values indicate a higher priority, with 0.0 being the lowest priority and 1.0 being the highest.

Within the same device, queues with higher priority may be allotted more processing time than queues with lower priority. The implementation makes no guarantees with regards to ordering or scheduling among queues with the same priority, other than the constraints defined by any explicit synchronization primitives. The implementation make no guarantees with regards to queues across different devices.

An implementation may allow a higher-priority queue to starve a lower-priority queue on the same VkDevice until the higher-priority queue has no further commands to execute. The relationship of queue priorities must not cause queues on one VkDevice to starve queues on another VkDevice.

No specific guarantees are made about higher priority queues receiving more processing time or better quality of service than lower priority queues.

As I read it, this doesn't allow one app to queue itself higher than another, and only affects queue's created on the single VkDevice.  Now whether any hardware actually does this... you would know better than I, I image.

As far as secondary command buffers, I've seen that suggested.  I don't disagree, its just that I don't see that being faster than just recording a bunch of primary command buffers in most circumstances.  The only 2 situations I could come up with were:

1) The small command buffers are all within the same render pass, in which case you would need secondary command buffers.

2) You have way too many (thousands? millions?) small primary command buffers, and that might cause some performance issues on submit, so by recording them as secondary and using another thread to bundle them into a single primary, might make the submit faster.

Some interesting points, i made this test now:

1: 10 dispatches of 50 wavefronts

2: 50 dispatches of 50 wavefronts

3: 50 dispatches of 50 wavefronts

With memory barrier after each dispatch and 1 queue: 0.46 ms

With memory barrier after each dispatch and 3 queues, one per shader: 0.21 ms

 

So we can use multiple queues to keep working while another queue is stalled.

I'll modify my test to see if i still could one queue for the same purpose by setting the memory ranges within the same buffer per shader, or by using multiple buffers per shader...

EDIT1:

...but first i tried to make the frist shader 5 times more work than 2 & 3. Actually all shaders did the same calculations, so i cant be sure a barrier on queue 1 does not stall queue 0 as well because barriers happen at the same time. Now i see shader 1 still completes first and is slightly faster than the other two so it is not affected by their barriers :)

Runtime 3 queues: 0.18ms, 1 queue: 0.44ms (not the first time seeing doing more work is faster on small loads)

 

 

 

 

 

23 minutes ago, JoeJ said:

Some interesting points, i made this test now:

1: 10 dispatches of 50 wavefronts

2: 50 dispatches of 50 wavefronts

3: 50 dispatches of 50 wavefronts

With memory barrier after each dispatch and 1 queue: 0.46 ms

With memory barrier after each dispatch and 3 queues, one per shader: 0.21 ms

 

So we can use multiple queues to keep working while another queue is stalled.

I'll modify my test to see if i still could one queue for the same purpose by setting the memory ranges within the same buffer per shader, or by using multiple buffers per shader...

Its good to know that theory and practice align, at least for this : ) Nice work.  I'm curious, what sort of barrier parameters are you using?

12 minutes ago, Ryan_001 said:

Its good to know that theory and practice align, at least for this : ) Nice work.  I'm curious, what sort of barrier parameters are you using?

BufferMemoryBarriers, here's code.

I leave comments in to illustrate how poor and uncertain the specs leave us at trial and error - or would you get the idea you need to set VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT for an indirect compute dispatch? :)

(Of course i could remove this here as i'm only writing some prefix sum results and no dispatch count, but offset and size becomes interesting now...)

 


	void MemoryBarriers (VkCommandBuffer commandBuffer, int *bufferList, const int numBarriers)
    {
        int const maxBarriers = 16;
        assert (numBarriers <= maxBarriers);
	        VkBufferMemoryBarrier bufferMemoryBarriers[maxBarriers] = {};
        //VkMemoryBarrier memoryBarriers[maxBarriers] = {};
	        for (int i=0; i<numBarriers; i++)
        {
            bufferMemoryBarriers.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
            //bufferMemoryBarriers.srcAccessMask = VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
            //bufferMemoryBarriers.dstAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT;
            bufferMemoryBarriers.srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT;
            bufferMemoryBarriers.dstAccessMask = VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
            //bufferMemoryBarriers.srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT | VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
            //bufferMemoryBarriers.dstAccessMask = VK_ACCESS_MEMORY_WRITE_BIT | VK_ACCESS_SHADER_WRITE_BIT | VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
            bufferMemoryBarriers.srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
            bufferMemoryBarriers.dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
            bufferMemoryBarriers.buffer = buffers[bufferList].deviceBuffer;
            bufferMemoryBarriers.offset = 0;
            bufferMemoryBarriers.size = VK_WHOLE_SIZE;
	            //memoryBarriers.sType = VK_STRUCTURE_TYPE_MEMORY_BARRIER;
            //memoryBarriers.srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT;// | VK_ACCESS_SHADER_WRITE_BIT;
            //memoryBarriers.dstAccessMask = VK_ACCESS_MEMORY_READ_BIT;// | VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
        }
	        vkCmdPipelineBarrier(
            commandBuffer,
            VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT,
            VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT | VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT,
            0,//VkDependencyFlags       
            0, NULL,//numBarriers, memoryBarriers,//
            numBarriers, bufferMemoryBarriers,
            0, NULL);
    }
        
    void Record (VkCommandBuffer commandBuffer, const uint32_t taskFlags,
        int profilerStartID, int profilerStopID, bool profilePerTask = true, bool use_barriers = true)
    {
        VkCommandBufferBeginInfo commandBufferBeginInfo = {};
        commandBufferBeginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
        commandBufferBeginInfo.flags = 0;//VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;
	        vkBeginCommandBuffer(commandBuffer, &commandBufferBeginInfo);
	
#ifdef USE_GPU_PROFILER
        if (profilerStartID>=0) profiler.Start (profilerStartID, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
#endif
	 
	
        if (taskFlags & (1<<tTEST0))
        {
            vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayouts[tTEST0], 0, 1, &descriptorSets[tTEST0], 0, nullptr);
        
            vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelines[taskToPipeline[tTEST0]]);
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Start (TS_TEST0, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
            int barrierBuffers[] = {bTEST0};
            for (int i=0; i<TASK_COUNT_0; i++)
            {
                vkCmdDispatchIndirect(commandBuffer, buffers[bDISPATCH].deviceBuffer, sizeof(VkDispatchIndirectCommand) * (0 + i) );
                if (use_barriers) MemoryBarriers (commandBuffer, barrierBuffers, 1);
            }
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Stop (TS_TEST0, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
        }
	        if (taskFlags & (1<<tTEST1))
        {
            vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayouts[tTEST1], 0, 1, &descriptorSets[tTEST1], 0, nullptr);
        
            vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelines[taskToPipeline[tTEST1]]);
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Start (TS_TEST1, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
            int barrierBuffers[] = {bTEST1};
            for (int i=0; i<TASK_COUNT_1; i++)
            {
                vkCmdDispatchIndirect(commandBuffer, buffers[bDISPATCH].deviceBuffer, sizeof(VkDispatchIndirectCommand) * (200 + i) );
                if (use_barriers) MemoryBarriers (commandBuffer, barrierBuffers, 1);
            }
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Stop (TS_TEST1, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
        }
	        if (taskFlags & (1<<tTEST2))
        {
            vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelineLayouts[tTEST2], 0, 1, &descriptorSets[tTEST2], 0, nullptr);
        
            vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipelines[taskToPipeline[tTEST2]]);
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Start (TS_TEST2, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
            int barrierBuffers[] = {bTEST2};
            for (int i=0; i<TASK_COUNT_2; i++)
            {
                vkCmdDispatchIndirect(commandBuffer, buffers[bDISPATCH].deviceBuffer, sizeof(VkDispatchIndirectCommand) * (400 + i) );
                if (use_barriers) MemoryBarriers (commandBuffer, barrierBuffers, 1);
            }
    #ifdef PROFILE_TASKS
            if (profilePerTask) profiler.Stop (TS_TEST2, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
    #endif
        }
	 
	#ifdef USE_GPU_PROFILER
        if (profilerStopID>=0) profiler.Stop (profilerStopID, commandBuffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
#endif
	        vkEndCommandBuffer(commandBuffer);
	    }
	

This topic is closed to new replies.

Advertisement