Advertisement

Descriptors and Resources in relation to Warps/Wavefronts

Started by June 14, 2019 05:58 PM
4 comments, last by Funkymunky 5 years, 7 months ago

Does anyone know how the GPU's scheduler uses the descriptors?  Does it attach them to the warp/wavefront directly, or does it look up the associated resource first?  In other words, when a warp/wavefront runs and accesses buffers/textures, is each thread bouncing through the indirection of the descriptor every time, or does it have direct access to the resource at that point?

I ask because it seems like lining up contiguous tables in the descriptor heap to reduce cache misses could be useful, but only if that indirection doesn't happen per-thread.  Otherwise I'd be more inclined to just scatter the tables throughout the heap in a more space-efficient manner.

That depends a little on which hardware you are on. In general things like root constants are pre-loaded into SGPRs and hence there is no indirection in the shader, but there is a limit to this after which things get 'promoted' to other memory. Some hardware has more flexibility in what types of things it can access through raw pointers vs descriptors (e.g. some unified memory addressing consoles have a lot more flexibility)

Also on some older GPUs NVIDIA it seems to want to store frequently accessed constant buffer data earlier, because there is seemingly some special hardware to prefetch that (I couldn't find the link off hand, but you can google that.) On much newer hardware that stuff seems to matter less. (A good anecdote is some project I did where using a large buffer descriptor was 2x slower on the GPU than using dedicated constant buffers for a Tegra X1, whereas on a 1080 GTX there was near 0 difference). That said I haven't seen similar benefits on AMD hardware, so YMMV.

Quote

I ask because it seems like lining up contiguous tables in the descriptor heap to reduce cache misses could be useful, but only if that indirection doesn't happen per-thread.

Just to be clear that a cache miss on the GPU isn't terrible per definition. As long as you can hide the latency (similar to texture fetch) with another warp/wavefront then you don't notice it too much. In my experience it is generally better keeping VGPR pressure lower... not to say there is no benefit in aligning the descriptors better... but VGPRs usage translates better to a larger subset of hardware, where descriptor layout seems to be more finicky per hardware. And that is totally cool to spend a lot of time if you're optimizing for PS4 or XBOX One, but that's not so cool if you are targeting the PC/Mobile market.

Quote

Does anyone know how the GPU's scheduler uses the descriptors?

I find that AMD is particularly open about this, but NVIDIA is fairly secretive about this. If anyone does find good links about this, please post as I would like to know more about this.

Advertisement

Ah, so that brings up a good point with the root constants.  Those are copied directly into the root signature when you call SetGraphicsRoot32BitConstants.  So then does SetGraphicsRootDescriptorTable copy in the descriptors from the heap, or just tell the shader where to look in the heap for them?   If it copies them, I see no reason to be overly-concerned with the locality of the descriptors.

On recent AMD HW (pre-Navi) the scheduler isn't really involved except for setting up the "user" SGPR's, which are the first 16 SGPR's available to the shader. These guys will generally be used to pass in the data represented in your root signature: so it will have pointers (or offsets) for descriptor tables, root constants, root SRV's, etc. The shader core itself is then responsible for loading a particular descriptor into SGPR's (which in the simple case will be done by loading data at a static offset from the start of a descriptor table) and then passing those SGPR's as a parameter to to the texture loading instructions. In more complex cases (bindless) the descriptor offset/index will not be static, and will instead come from another buffer or calculated from another value. Either way the load of the descriptor from memory -> SGPR goes through their scalar K cache, so it may be to your benefit to try to group together your descriptors within your heap in order to reduce cache misses. 

Nvidia is much more opaque about all of this, but they support similar functionality as AMD so I would guess it's not terribly different under the hood (at least for Kepler hardware and later that supports Tier 2/3 resource binding). They have also mentioned having a special cache for descriptors that has an associated performance counter in Nsight, so you could check that out if you want to see if you're getting a lot of descriptor cache misses. 

15 minutes ago, Funkymunky said:

So then does SetGraphicsRootDescriptorTable copy in the descriptors from the heap, or just tell the shader where to look in the heap for them?

It totally depends on the hardware, but for recent desktop GPU's it's most likely going to be latter (like I mentioned above, AMD GCN hardware definitely works this way since descriptors are just loaded from memory). However D3D12 (and Vulkan) are abstracted in such a way that the driver could certainly do something much more complicated under the hood. For instance, the hardware might have an older-style setup where texture descriptors live in actual physical hardware registers, and so the descriptor table has to get copied into those registers before a draw can access them. Anything that reports itself as Tier 1 for resource binding (or doesn't support VK_EXT_descriptor_indexing on Vulkan) almost certainly has some weird stuff going on under the hood, hence the reason for weird limitations like this one:

Quote

There is an additional restriction for Tier 1 hardware that applies to all heaps, and to Tier 2 hardware that applies to CBV and UAV heaps, that all descriptor heap entries covered by descriptor tables in the root signature must be populated with descriptors by the time the shader executes, even if the shader (perhaps due to branching) does not need the descriptor.

 

Awesome, thanks to both of you for the thorough and illuminating replies.  So it sounds like the normal rule of thumb applies, don't go nuts over-engineering something to mitigate cache misses, but if you can group some commonly used items together contiguously, then yeah go for it.

This topic is closed to new replies.

Advertisement