On AMD hardware, it is a simultaneous dispatch of 64 threads if I am not mistaken. The number also add up for that: 1280*720/64 = 14400
On Nvidia, these are called warps and stands for a group of 32 threads.
On AMD hardware, it is a simultaneous dispatch of 64 threads if I am not mistaken. The number also add up for that: 1280*720/64 = 14400
On Nvidia, these are called warps and stands for a group of 32 threads.
1 hour ago, matt77hias said:What is one wave?
A wave is a group of threads that run in lockstep with one another when running shaders. NVIDIA call it a "Warp", it's also called a "Wavefront".
On NVIDIA hardware this has traditionally been 32 threads all running in lockstep, on AMD GCN hardware this number is 64.
To render a full-screen triangle at 1920x1080 you have to render 2,073,600 pixels. Once those get batched up into 'waves' (groups) of 64, you require 32,400 of them to render the entire screen. If there's any inefficiency due to using two triangles that would increase the number of 2x2 quads that get launched and consequently the number of waves would go up as well. Even at 720p the amount of waste threads is only 0.2% and at 1080p it's even lower at 0.14%.
Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group
22 hours ago, turanszkij said:You are probably good with the input layout reflected from the shader, though I am not used to doing that. I have few layouts which I create by hand and shared across different shaders.
Apparently, no input layout needs to be created. You can just use and bind nullptr.
I use the same template class (but with different instantiations) for all shader types except for the vertex shader since I store the input layout (which feels a bit redundant) as well.
🧙
Could you please respond to the following post by ajmiles:
14 hours ago, ajmiles said:That's not my experience of full-screen triangles on GCN.
Using probably the same tools you're talking about, I can see that a single full-screen triangle at 1080p on an Xbox One generates exactly 32,400 waves and the Radeon RGP Profiler measures exactly 14,400 waves for a 720p full-screen triangle on PC.
Two triangles produces 14,433 waves for 720p on PC and 32,446 waves for 1080p on Xbox.
YMMV on other consoles, but it's not an AMD/GCN thing, and it's definitely not true to say it won't work on AMD GPUs on Windows.
I very curious what the consensus on this technique is,
-potential energy is easily made kinetic-
I'm personally using the fullscreen triangle trick. I haven't profiled it against the fullscreen quad (two triangles with a diagonal seam).
While we're on the topic, NV has a vendor-specific extension that rasterizes every pixel within the 2D AABB of a primitive -- so if you drew a triangle that covered one diagonal half of the screen, it's AABB would cover the whole screen, and if this extension is active at the time, your shader would run for every pixel.
Another option is to implement your full-screen passes via compute shaders
. 22 Racing Series .
Hum, i would have to double check, if someone say it works, then i don't see the point to camp on my position. It is possible we set a scissor rectangle to the size of the viewport by convenience or something like that and it screw up the triangle.
I would not even dare fixing it if it is that, the gain are unlikely to be at the level of noise in the profiling, and most of our full screen pass are computes. The later is not always the best choice, to be fair, as it is a trade of L2 cache bandwidth versus the ROP, manual SRGB conversions and a few other considerations, but full screen passes are rarely alone and we can save a few flush/invalidation by staying compute all the way.
3 hours ago, Hodgman said:Another option is to implement your full-screen passes via compute shaders
What would be the benefits of the CS? Is it just skipping the pipeline until PS (VS and RS, specifically)?
🧙
4 hours ago, matt77hias said:What would be the benefits of the CS? Is it just skipping the pipeline until PS (VS and RS, specifically)?
The upside is that you skip the IA->VS->RS->PS->OM pipeline with all the state setup involved and instead you have a simple CS pipeline. In the compute shader, it is you who assingns the workload, so there should be no confusion as to how the threads are dispatched. CS pipeline can sometimes be run in parallel with the graphics pipeline in newer APIs (I've no experience with that yet).
The downside is that if you are running CS on the graphics pipe, you have to switch execution from the rasterizing pipeline to compute which involves flushing the graphics buffer, then executing your compute, waiting for it to finish and resuming your graphics processing after that
The other advantage is that the CS allows for implementation of different styles of algorithms that aren't possible in a PS.
The CS has access to (up to) 32KiB of local storage within a workgroup, which allows different threads/pixels to share data with each other.
e.g. for a 3x3 pixel blur filter in a PS, every pixel will fetch 9 texels from the source texture, blend them, and output a single result.
In a CS implementation, you might get 64 threads to read a 10x10 area of pixels into the local store (~1.5 texel reads per pixel instead of 9!), and then each thread can fetch the 9 texels that it requries from the local store (which is basically free compared to reading from memory), blend them, and output a single result each / an 8x8 area of results total.
Also, compute shaders don't hard-code the destination / result location. You can write code that works like a normal PS, where each thread outputs to a specific pixel on the screen... or you can write code where each thread writes to a random pixel, or multiple pixels, or writes to a pixel using a dynamic offset (scattered writes).
e.g. the normal way of writing a blur is to "gather" (above, each destination pixel reads from 9 source pixels), but a "scatter" version would have every source pixel write to 9 destination pixels!
When you first start using CS instead of PS, you get very small improvements (e.g. from skipping the IA->VS->RS costs)... but afterwards, many new doors open for completely different approaches to problems.
. 22 Racing Series .