Question concerning internal queue organisation

Green_Baron · 2017-07-14T18:41:05

Hello, my first post here :-) About half a year ago i started with C++ (did a little C before) and poking into graphics programming. Right now i am digging through the various vulkan tutorials. A probably naive question that arose is: If i have a device (in my case a GTX970 clone) that exposes on each of two gpus two families, one with 16 queues for graphics, compute, etc and another one with a single transfer queue, do i loose potential performance if i only use 1 of the 16 graphics queues ? Or, in other words, are these queues represented by hardware or logical entities ? And how is that handled across different vendors ? Do intel and amd handle this similar or would a program have to take care of different handling across different hardware ? Cheers gb

Graphics and GPU Programming Programming Vulkan C++ Advice

Started by Green_Baron July 10, 2017 08:05 AM

24 comments, last by JoeJ 7 years, 7 months ago

JoeJ

4,412

July 14, 2017 03:33 PM

Ok, so finally and as expected it makes no difference with these options:

Use barriers for unique buffers per task.

Use barriers for nonoverlapping memory regions per task but the same buffer for all.

The driver could figure out to still use async with 1 queue in both cases, but it does not. Just like the specs say.

I hope i've set up everything correctly (still unsure about the difference of VK_ACCESS_MEMORY_WRITE_BIT and VK_ACCESS_SHADER_WRITE_BIT, but this did not matter).

So the conclusion is:

We have to use multiple queues to keep busy on pipeline barriers.

We should reduce sync between queues to a minimum.

A bit more challenging than initially thought and i hope 2 saturating tasks in 2 queues don't slow each other down too much. If so we need more sync to prevent this and it becomes a hardware dependent act of balancing. But i'm optimisitc and it all makes sense now.

Ryan_001

3,477

July 14, 2017 03:40 PM

Interesting, I don't know if you need VK_ACCESS_MEMORY_READ_BIT and VK_ACCESS_MEMORY_WRITE_BIT there.

Quote
VK_ACCESS_MEMORY_READ_BIT specifies read access via non-specific entities. These entities include the Vulkan device and host, but may also include entities external to the Vulkan device or otherwise not part of the core Vulkan pipeline. When included in a destination access mask, makes all available writes visible to all future read accesses on entities known to the Vulkan device.
VK_ACCESS_MEMORY_WRITE_BIT specifies write access via non-specific entities. These entities include the Vulkan device and host, but may also include entities external to the Vulkan device or otherwise not part of the core Vulkan pipeline. When included in a source access mask, all writes that are performed by entities known to the Vulkan device are made available. When included in a destination access mask, makes all available writes visible to all future write accesses on entities known to the Vulkan device.

I read that as meaning memory read/write bits are things outside the normal Vulkan scope, like the presentation/windowing system. The demo/examples I looked at also never included those bits. I agree with you completely in that the spec leaves alot of things ambiguously defined. What surprised me a bit was that image layout transitions are considered both a read and write, so you have to include access/stage masks for the hidden read/write that occurs during transitions.

This thread has helped clarify a lot of these things.

I wrote my own pipeline barrier wrapper, which I found made a lot more sense (apart from not really understanding what VK_ACCESS_MEMORY_READ_BIT and VK_ACCESS_MEMORY_WRITE_BIT mean). The whole thing isn't important but you might find the flag enumeration interesting.


enum class MemoryDependencyFlags : uint64_t {

	none											= 0,

	indirect_read								= (1ull << 0),				// VK_ACCESS_INDIRECT_COMMAND_READ_BIT + VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT
	index_read								= (1ull << 1),				// VK_ACCESS_INDEX_READ_BIT + VK_PIPELINE_STAGE_VERTEX_INPUT_BIT
	attribute_vertex_read				= (1ull << 2),				// VK_ACCESS_VERTEX_ATTRIBUTE_READ_BIT + VK_PIPELINE_STAGE_VERTEX_INPUT_BIT

	uniform_vertex_read					= (1ull << 3),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_VERTEX_SHADER_BIT
	uniform_tess_control_read		= (1ull << 4),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_TESSELLATION_CONTROL_SHADER_BIT
	uniform_tess_eval_read			= (1ull << 5),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_TESSELLATION_EVALUATION_SHADER_BIT
	uniform_geometry_read			= (1ull << 6),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_GEOMETRY_SHADER_BIT
	uniform_fragment_read				= (1ull << 7),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT
	uniform_compute_read				= (1ull << 8),				// VK_ACCESS_UNIFORM_READ_BIT + VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT

	shader_vertex_read					= (1ull << 9),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_VERTEX_SHADER_BIT
	shader_vertex_write					= (1ull << 10),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_VERTEX_SHADER_BIT
	shader_tess_control_read			= (1ull << 11),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_TESSELLATION_CONTROL_SHADER_BIT
	shader_tess_control_write		= (1ull << 12),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_TESSELLATION_CONTROL_SHADER_BIT
	shader_tess_eval_read				= (1ull << 13),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_TESSELLATION_EVALUATION_SHADER_BIT
	shader_tess_eval_write				= (1ull << 14),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_TESSELLATION_EVALUATION_SHADER_BIT
	shader_geometry_read				= (1ull << 15),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_GEOMETRY_SHADER_BIT
	shader_geometry_write			= (1ull << 16),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_GEOMETRY_SHADER_BIT
	shader_fragment_read				= (1ull << 17),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT
	shader_fragment_write				= (1ull << 18),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT
	shader_compute_read				= (1ull << 19),				// VK_ACCESS_SHADER_READ_BIT + VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
	shader_compute_write				= (1ull << 20),				// VK_ACCESS_SHADER_WRITE_BIT + VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT

	attachment_fragment_read		= (1ull << 21),				// VK_ACCESS_INPUT_ATTACHMENT_READ_BIT + VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT
	attachment_color_read				= (1ull << 22),				// VK_ACCESS_COLOR_ATTACHMENT_READ_BIT + VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
	attachment_color_write			= (1ull << 23),				// VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT + VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
	attachment_depth_read_early	= (1ull << 24),				// VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT + VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT
	attachment_depth_read_late		= (1ull << 25),				// VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT + VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT
	attachment_depth_write_early	= (1ull << 26),				// VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT + VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT
	attachment_depth_write_late	= (1ull << 27),				// VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT + VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT

	transfer_read							= (1ull << 28),				// VK_ACCESS_TRANSFER_READ_BIT + VK_PIPELINE_STAGE_TRANSFER_BIT
	transfer_write							= (1ull << 29),				// VK_ACCESS_TRANSFER_WRITE_BIT + VK_PIPELINE_STAGE_TRANSFER_BIT

	host_read									= (1ull << 30),				// VK_ACCESS_HOST_READ_BIT + VK_PIPELINE_STAGE_HOST_BIT
	host_write								= (1ull << 31),				// VK_ACCESS_HOST_WRITE_BIT + VK_PIPELINE_STAGE_HOST_BIT

	memory_read							= (1ull << 32),				// VK_ACCESS_MEMORY_READ_BIT
	memory_write							= (1ull << 33),				// VK_ACCESS_MEMORY_WRITE_BIT
	};

The formatting is a mess, but you get the idea. Only certain combinations of stage + access are allowed by the spec, by enumerating them it made it far more clear which to pick. I then can directly convert these to the associated stage + access masks without any loss in expressiveness/performance (or at least there shouldn't be if I understand things correctly).

JoeJ

4,412

July 14, 2017 05:10 PM

Copy that - it's good to compare your own guessing against the guessing of others

(I spot you don't cover the case of a compute shader writing indirect dispacth count.)

I wonder if Events could help here: http://vulkan-spec-chunked.ahcox.com/ch06s03.html

I have not used them yet. Could i do something like triggereing a memory barrier, processing some other work, waiting on the barrier with a high chance it has been done already?

I really need Vulkan for dummies that tells me some usecases of such things...

Ryan_001

3,477

July 14, 2017 05:34 PM

5 minutes ago, JoeJ said:
(I spot you don't cover the case of a compute shader writing indirect dispacth count.)

I'm not sure exactly what you mean. The flags are pretty much taken verbatim from Table 4 of the spec: https://www.khronos.org/registry/vulkan/specs/1.0/html/vkspec.html#VkPipelineStageFlagBits (scroll down a screen or two).

I haven't played around with indirect stuff yet. I'm assuming you write to a buffer the commands (either through memmap/staging buffer/copy or through a compute shader or similar), then use that buffer as the source for the indirect command correct? If I was transferring from the host then I'd use host_write or transfer_write as my source flags (depending on whether or not I used a staging buffer), and then I'd use indirect_read as my dest flags. If I were computing the buffer on the fly would you not use shader_compute_write as src, and indirect_read as dest?