Advertisement

Write only buffers/states for Compute Shader

Started by July 30, 2018 12:40 PM
6 comments, last by pcmaster 6 years, 6 months ago

I see, for writing, there is only the UAV type of buffer available. I read it has some limitations for the number of simultaneous writes.
For my computer pass, I need to only write to a buffer. And writes don't overlap. Wouldn't it be more optimized, if the API to offers write only buffers(convertible to other types through barriers)?
Although i read somewhere Render Target is internally UAV too, so maybe UAV is just as optimized for writes as a write only buffer would be.

What can you tell me about this?

(I mean not a write only buffer staying write only forever. I mean a buffer that can become write only and in such way be optimized for non overlaping writes only)

2 hours ago, NikiTo said:

What can you tell me about this?

Not much, but i remember AMD to consider exposing special options for OpenCL. The idea is to bypass cache, probably to reduce cache trashing or to have faster reading.

However, Vulkan has options to set read/write states individually: https://www.khronos.org/registry/vulkan/specs/1.1-extensions/man/html/VkAccessFlagBits.html

I would wonder if DX12 has nothing like this?

Advertisement

Render Targets (and Depth-Stencil Targets) won't be UAVs internally, at least not on AMD GCN. Where did you read it?

AMD pixel shaders have the exp instruction which exports colour and those writes go through a different cache hierarchy, namely the colour-block cache and bypass the L1 and L2 caches completely. The dedicated colour-block hardware ("output-merger"-ish) serialises concurrent writes (there is NO guarantee that 2 polygons won't write to the same pixel, nor the order they'll do so), handles blending, etc.

UAV writes, on the other hand, use the  image_store  class of instructions, which just write to an address, usually via L2, bypassing L1.

For reference, of what the HW is capable of, you can have a very quick look at the AMD Southern Islands Instruction Set PDF.

Does anyone know how different this is on NVIDIA or Intel?

Why do you think that writing to a UAV will be not "optimised"?

35 minutes ago, JoeJ said:

However, Vulkan has options to set read/write states individually: https://www.khronos.org/registry/vulkan/specs/1.1-extensions/man/html/VkAccessFlagBits.html

Interesting
 

14 minutes ago, pcmaster said:

Why do you think that writing to a UAV will be not "optimised"?

13 minutes ago, pcmaster said:

Where did you read it?

I read that UAV has no problems for many simultaneous reads, but has a limit on concurrent writes, which is much lower than reads.
I can not recall the links I read all those.

https://docs.microsoft.com/en-us/windows/desktop/api/d3d11/nf-d3d11-id3d11devicecontext-omsetrendertargetsandunorderedaccessviews

This link mentions the connection between the RTV and UAV.
 

7 minutes ago, NikiTo said:

Interesting

Notice that this are flags you set for barriers. (It's not that you declare memory to be write only)

Pretty sure DX12 has this too, but it's not exactly what you asked for i guess.

1 hour ago, JoeJ said:

but it's not exactly what you asked for i guess.

I want it to be faster like this:

1 hour ago, pcmaster said:

bypass the L1 and L2 caches completely

Those flags could just control the access and nothing more. Who knows...

I can think now about covering the output with triangles and use pixel shader, but i would lose the advantage of LDS.
Unless, I could try to use Shader Model 6 that for some reason offers me compute shader instructions inside the pixel shader. I dont know if there is a possibility for using LDS with Pixel shader with SM6.

Of course this will add work for the rasterizer that in AMD is not dedicated....

Advertisement

On GCN, LDS and GDS are totally useable from pixel shaders. It just isn't exposed in DX11 or DX12 SM5 (for obvious reasons) so it isn't going to help you on PC :(

Also, from AMD GCN block diagrams - 1 rasteriser reads 1 triangle per cycle and outputs 16 pixels per cycle. After shading, each Render Back-End (there are usually 2) can do multiple blends, depth and stencil samples per cycle. Looks pretty dedicated to me :)

 

 

This topic is closed to new replies.

Advertisement