Drawing with a nullptr vertex and index buffer in D3D11

Matthias Moulin · 2017-09-06T11:29:05

Apparently, you can render a quad directly in d3d10< without using vertices and indices. The approach is: "In order to render a full screen quad, you will need to set both index and vertex buffers to null. Set the topology to triangle strip and call Draw with four vertices starting from position zero." device_context->IASetVertexBuffers(0, 1, nullptr, {???}, {0}); device_context->IASetIndexBuffer(nullptr, ???, 0); device_context->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLESTRIP); device_context->Draw(4, 0); What index format and stride size do you need to use? Is it a good practice to use this approach for other "frequently" used meshes as well (such as a cube of lines for visualizing bounding boxes)?

Graphics and GPU Programming Programming

Started by matt77hias September 04, 2017 07:13 PM

18 comments, last by Hodgman 7 years, 5 months ago

turanszkij

545

September 05, 2017 01:31 PM

On AMD hardware, it is a simultaneous dispatch of 64 threads if I am not mistaken. The number also add up for that: 1280*720/64 = 14400

On Nvidia, these are called warps and stands for a group of 32 threads.

Wicked Engine

Adam Miles

3,468

September 05, 2017 01:32 PM

1 hour ago, matt77hias said:
What is one wave?

A wave is a group of threads that run in lockstep with one another when running shaders. NVIDIA call it a "Warp", it's also called a "Wavefront".

On NVIDIA hardware this has traditionally been 32 threads all running in lockstep, on AMD GCN hardware this number is 64.

To render a full-screen triangle at 1920x1080 you have to render 2,073,600 pixels. Once those get batched up into 'waves' (groups) of 64, you require 32,400 of them to render the entire screen. If there's any inefficiency due to using two triangles that would increase the number of 2x2 quads that get launched and consequently the number of waves would go up as well. Even at 720p the amount of waste threads is only 0.2% and at 1080p it's even lower at 0.14%.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

matt77hias

Author

560

September 05, 2017 06:15 PM

22 hours ago, turanszkij said:
You are probably good with the input layout reflected from the shader, though I am not used to doing that. I have few layouts which I create by hand and shared across different shaders.

Apparently, no input layout needs to be created. You can just use and bind nullptr.

I use the same template class (but with different instantiations) for all shader types except for the vertex shader since I store the input layout (which feels a bit redundant) as well.

🧙

Infinisearch

3,058

September 06, 2017 01:50 AM

@galop1n

Could you please respond to the following post by ajmiles:

14 hours ago, ajmiles said:
That's not my experience of full-screen triangles on GCN.
Using probably the same tools you're talking about, I can see that a single full-screen triangle at 1080p on an Xbox One generates exactly 32,400 waves and the Radeon RGP Profiler measures exactly 14,400 waves for a 720p full-screen triangle on PC.
Two triangles produces 14,433 waves for 720p on PC and 32,446 waves for 1080p on Xbox.
YMMV on other consoles, but it's not an AMD/GCN thing, and it's definitely not true to say it won't work on AMD GPUs on Windows.

I very curious what the consensus on this technique is,

-potential energy is easily made kinetic-

Hodgman

52,718

September 06, 2017 02:27 AM

I'm personally using the fullscreen triangle trick. I haven't profiled it against the fullscreen quad (two triangles with a diagonal seam).

While we're on the topic, NV has a vendor-specific extension that rasterizes every pixel within the 2D AABB of a primitive -- so if you drew a triangle that covered one diagonal half of the screen, it's AABB would cover the whole screen, and if this extension is active at the time, your shader would run for every pixel.

Another option is to implement your full-screen passes via compute shaders

. 22 Racing Series .

galop1n

1,046

September 06, 2017 04:11 AM

Hum, i would have to double check, if someone say it works, then i don't see the point to camp on my position. It is possible we set a scissor rectangle to the size of the viewport by convenience or something like that and it screw up the triangle.

I would not even dare fixing it if it is that, the gain are unlikely to be at the level of noise in the profiling, and most of our full screen pass are computes. The later is not always the best choice, to be fair, as it is a trade of L2 cache bandwidth versus the ROP, manual SRGB conversions and a few other considerations, but full screen passes are rarely alone and we can save a few flush/invalidation by staying compute all the way.

matt77hias

Author

560

September 06, 2017 05:57 AM

3 hours ago, Hodgman said:
Another option is to implement your full-screen passes via compute shaders

What would be the benefits of the CS? Is it just skipping the pipeline until PS (VS and RS, specifically)?

🧙

turanszkij

545

September 06, 2017 10:57 AM

4 hours ago, matt77hias said:
What would be the benefits of the CS? Is it just skipping the pipeline until PS (VS and RS, specifically)?

The upside is that you skip the IA->VS->RS->PS->OM pipeline with all the state setup involved and instead you have a simple CS pipeline. In the compute shader, it is you who assingns the workload, so there should be no confusion as to how the threads are dispatched. CS pipeline can sometimes be run in parallel with the graphics pipeline in newer APIs (I've no experience with that yet).

The downside is that if you are running CS on the graphics pipe, you have to switch execution from the rasterizing pipeline to compute which involves flushing the graphics buffer, then executing your compute, waiting for it to finish and resuming your graphics processing after that

Wicked Engine

Hodgman

52,718

September 06, 2017 11:29 AM

The other advantage is that the CS allows for implementation of different styles of algorithms that aren't possible in a PS.

The CS has access to (up to) 32KiB of local storage within a workgroup, which allows different threads/pixels to share data with each other.

e.g. for a 3x3 pixel blur filter in a PS, every pixel will fetch 9 texels from the source texture, blend them, and output a single result.
In a CS implementation, you might get 64 threads to read a 10x10 area of pixels into the local store (~1.5 texel reads per pixel instead of 9!), and then each thread can fetch the 9 texels that it requries from the local store (which is basically free compared to reading from memory), blend them, and output a single result each / an 8x8 area of results total.

Also, compute shaders don't hard-code the destination / result location. You can write code that works like a normal PS, where each thread outputs to a specific pixel on the screen... or you can write code where each thread writes to a random pixel, or multiple pixels, or writes to a pixel using a dynamic offset (scattered writes).
e.g. the normal way of writing a blur is to "gather" (above, each destination pixel reads from 9 source pixels), but a "scatter" version would have every source pixel write to 9 destination pixels!

When you first start using CS instead of PS, you get very small improvements (e.g. from skipping the IA->VS->RS costs)... but afterwards, many new doors open for completely different approaches to problems.

. 22 Racing Series .

Drawing with a nullptr vertex and index buffer in D3D11

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Drawing with a nullptr vertex and index buffer in D3D11

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines