Advertisement

Indirect Draw For Particle System In DX12

Started by April 29, 2020 07:17 PM
34 comments, last by NikiTo 4 years, 9 months ago

I'm working on a GPU based particle system in DX12. Each frame I run a compute shader that updates all the particles and appends the indices of the active particles into an index buffer. When it's time to render, I use this index buffer as a lookup table so that each particle can find its own data.

Atleast that's the plan. The problem is that I don't know how I should handle the indirect drawing. As I see it, I currently face 2 problems:

  1. If I use the UAV counter from the index buffer, then I will make a draw call with 1 vertex per particle. This means I'd need to use the geometry shader to expand this single vertex into a quad. And from what I've heard, the geometry shader is quite slow and preferably avoided.
  2. If I instead use a separate counter that I increase by 4 for each particle that should be rendered, then I can use modulo and division to create a quad in the vertex shader. But in order to prevent a data race for each thread in the compute shader, I will probably need to use some kind of atomic writes into this buffer, which I have no idea how to do.

So I'd really appreciate some guidance of what the best option here is, and if it would be option 2, then how I could prevent a data race when writing to the counter buffer. Or maybe there's a much better way to handle this and I'm just overcomplicating it?

You can cascade the push counter in order to make it thousands times faster.
Use InterlockedAdd on a buffer inside LDS, then use InterlockedAdd on a counter in VRAM.
You could use a cascade between threads too, because coding directly a single InterlockedAdd for a group of 1024 could kill performance.

Advertisement

Thank you for the reply NikiTo.

NikiTo said:
You can cascade the push counter in order to make it thousands times faster. Use InterlockedAdd on a buffer inside LDS, then use InterlockedAdd on a counter in VRAM.

So if I understood you correctly, then I'd use 2 different counters, one is for the index buffer itself (which gets increased by 1 and is located in LDS), and the other one is for the vertex counter (which gets increased by 4 and is located in VRAM)?

What I don't understand is how I can manually put something in the LDS rather than in the VRAM?

NikiTo said:
You could use a cascade between threads too

Aha, you mean that first I would add up a “local” counter between just the 64 threads in each thread group, and then when all these groups are done (should I sync them?), I would use InterlockedAdd on each of those “local” counters?

Example for a workgroup of 1024-

Create a buffer in LDS(google how to do it) for the data. Create another buffer for counters.

Choose an arbitrary worksize, let say 64(you need to test various sizes here to see what works faster for your code).

Take the flattened thread ID of the thread and check if it produces a particle or discards it(I assume you discard some particles based on some criteria, otherwise your solution is rudimentary).

If a particle is accepted, increment the counters buffer in LDS at the position of the worksize chunk. And take the old value of the counter(three operand instruction gives you the old value in the same atomic operation) and write at that position in the data buffer the pos.xy of the accepted particle.

Then you merge that many small buffers in LDS into a bigger buffer in LDS. Then you merge the LDS buffer to the buffer in VRAM.

Only the part of the merging -

You have a single uint counter in VRAM. You initialize it at 0 before you start.

You have a buffer inside VRAM.

You have another uint counter inside LDS. You initialize it at zero in the beginning of the shader.

And you have a buffer inside LDS.

You push data into the buffer in LDS and the counter in LDS grows.

At the end of the shader, you InterlockedAdd(UAVBuff[0], counterInLDS, oldValueOfCounterInVRAM)

Then in the shader, you copy counterInLDS elements of the buffer in LDS into VRAM, starting the writes at position oldValueOfCounterInVRAM.

If you just generate particles, every time the same amount of particles, and you don't discard particles, you need no counters, no pushes, not even indirect draw.

I assume NikiTos prposal is to fill a small buffer of LDS (group shared memory), also using atomics to LDS to increase the size counter - both is much faster than doing this from each thread to main memory directly.

But you could extend this idea to write full quads if that helps your mentioned problem. (Which i'm not sure about)

Some pseudo code, but i'm not familiar with HLSL:

#define WGS 256
workgroupSize = (WGS,1,1); // i don't know how DirectX calls this

shared int frustumParticles[WGS * 2]; // small LDS buffer of active particels that survived culling 
shared int counter = 0;
shared int videoMemoryIndex = 0;

main ()
{
	for (int i=0; i<totalParticles / WGS; i++) // ignoring range of a singel workgroup here for simplicity
	{
		int particleIndex = i*WGS + localThreadID;
		
		bool active = CullParticle(particleIndex);
		if (active)
		{
			int index = atomicAdd(counter, 1);
			frustumParticles[index] = particleIndex;
		}
		
		barrier();
		
		// half buffer is full? need to write a batch of particles to main video memory...
		
		if (localThreadID==0 &amp;amp;&amp;amp; counter >= WGS) 
		{
			videoMemoryIndex = atomicAdd(bufferCounter, WGS * 4); // only one atomic write to video memory necessary to write 256 particles and 1024 vertices
		} 
		
		barrier();
		
		if (counter >= WGS)
		{
			particleIndex = frustumParticles[localThreadID];
			int dstIndex = videoMemoryIndex + localThreadID*4;
			
			indirectDrawVertexBuffer[dstIndex + 0] = ParticleVertex00(particleIndex); 
			indirectDrawVertexBuffer[dstIndex + 1] = ParticleVertex01(particleIndex); 
			indirectDrawVertexBuffer[dstIndex + 2] = ParticleVertex11(particleIndex); 
			indirectDrawVertexBuffer[dstIndex + 3] = ParticleVertex10(particleIndex); 
			
			frustumParticles[localThreadID] = frustumParticles[localThreadID + WGS]; // scroll back the LDS buffer
		} 
		
		if (localThreadID==0 &amp;amp;&amp;amp; counter >= WGS) 
		{
			atomicAdd(IndirectDrawVertexCount(WGS, 4);
			counter -= LDS; // scroll back the LDS counter
		}
		
		barrier();
		
	}
	
	// todo: write the remaining particles if counter > 0
}

The code does not address the fact the particle count may not be an exact multiple of 256, but you get the idea.

And ofc. you can do such compaction much faster with subgroup functions, and LDS buffer can be larger than just twice the workgroup size.

BTW, i assume using GS would be faster on most actual hardware then writing 4 vertices and a single index would do, but i don't know.

I could not read the code of JoeJ, my eyes are wasted with my own code.

But, a note - the LDS buffer can be of any size the LDS limits allows for. Up to 32Kb(one DWORD less for the counter itself).

What is killing the performance is two threads to be atomically writing to the same address at the same time. Inevitably there will be such situations, but breaking it into groups, minimizes the impact.

Advertisement

@joej AFAIK GS is fast for generating small amount of output per amount of input. One pixel generating a quad should be pretty fast for the GS, AFAIK.

Thank you both for the replies.

I think that I understand the part about the counters and the group shared memory now. Although I need to try to implement it before I can say for certain. Will follow up if there occurs any problems.

One thing that is still unclear is how I should write the counter in group shared memory back to VRAM. Would it be better to use an if-statement so that only one thread does the write, or would that cause a branch divergence problem? Or would it be better to simply let all threads write the same value to the same address?

Although, I don't think that I can store indices in the group shared memory if it's limited to 32Kb. Because if I store the index of each particle that should be drawn, that would only allow me to store 8000 indices, while the particle system itself might support millions of particles. Or did you mean that I should only store a part of the index buffer and once a thread group is done, I write to multiple addresses at one?

Regarding the performance:

My plan was to have one massive particle buffer with all the particle data and then one index buffer consisting of only UINTs that describes which particles that should be drawn. Each time I run the compute shader, I would rewrite the index buffer.

When it comes to the drawing, I didn't mean to bind any vertex data to the pipeline. I'd simply make the draw call with the right amount of vertices and then calculate the position of each vertex in the vertex shader with the help of SV_VertexID. Each set of 4 vertices would then correspond to 1 index and that index would correspond to 1 particle in the particle data buffer. Am I overcomplicating this or is this a good solution?

For example, let's say that we're going to render 1 million particles. Would it be quicker to make the draw call with 1 million vertices and then use a geometry shader to expand these vertices into quads, or, would it be quicker to make the draw call with 4 million vertices directly?

fighting_falcon93 said:
Would it be better to use an if-statement so that only one thread does the write, or would that cause a branch divergence problem? Or would it be better to simply let all threads write the same value to the same address?

IIRC the former is always faster, but may depend on GPU. (No worries about branches in general, especially if the enclosed code is short. Using branches to avoid unnecessary memory access is usually always worth it.)

fighting_falcon93 said:
Or did you mean that I should only store a part of the index buffer and once a thread group is done, I write to multiple addresses at one?

Yes exactly. In my example i write batches of 256 active particles after their actual number has grown larger than 256, then keep buffering until we have again more than 256, etc.
This should work well because all threads write at the same time, so switching to another wave in flight while waiting on this works well to hide memory latency. In contrast, having just a few threads writing occasionally all the time is expected to perform worse.
The main win may even be to have only one atomic to global memory instead 256.
It depends on GPU how much benefit there is from such practice. I've seen big benefit on GCN (2 x faster), but only little benefit on Kepler when i tried this first years ago.
(Note: Some GPUs have HW support for append buffers, but the Khorons APIs i'm used to do not expose append buffers at all, so i have no experience with that. Maybe all the extra work is not worth it, but i'd try it out.)

Notice my code is not ideal because the same thread processes 2 different particles. First to cull it, and second to generate vertices from another particle. It would be better to store the particle position to LDS as well so the second load from video memory is not necessary.
But this needs more LDS and so could reduce occupancy. It is thus important to use profiling tools that tell you performance, occupancy, register and LDS usage for a given shader. Finally you would need to test on both NV and AMD, and eventually use different settings for different vendors, if you would want max perf.

For example, let's say that we're going to render 1 million particles. Would it be quicker to make the draw call with 1 million vertices and then use a geometry shader to expand these vertices into quads, or, would it be quicker to make the draw call with 4 million vertices directly?

Try both on target HW and see what's faster. Unfortunately this is how optimization works. ; ) Even with growing experience, HW always changes, so i never get rid of trying multiple approaches.

You write to counters in VRAM ONCE per merging. It is done by ONLY ONE thread(one per merge).

I would go for 1mln particles → GS → 4mln vertices.

Indices are fine, but maybe you will want to update the position of each particle between frames, so if anyways you are going to write 3 floats to VRAM, why write the index too? You already have a list with the survived vertices. Use that.

First make a simpler version work. Put your hands on it. This way you will master it. We can guide you, but you will learn by doing it.

This topic is closed to new replies.

Advertisement