Indirect Draw For Particle System In DX12

fighting_falcon93 · 2020-04-30T18:06:51

I'm working on a GPU based particle system in DX12. Each frame I run a compute shader that updates all the particles and appends the indices of the active particles into an index buffer. When it's time to render, I use this index buffer as a lookup table so that each particle can find its own data. Atleast that's the plan. The problem is that I don't know how I should handle the indirect drawing. As I see it, I currently face 2 problems: If I use the UAV counter from the index buffer, then I will make a draw call with 1 vertex per particle. This means I'd need to use the geometry shader to expand this single vertex into a quad. And from what I've heard, the geometry shader is quite slow and preferably avoided. If I instead use a separate counter that I increase by 4 for each particle that should be rendered, then I can use modulo and division to create a quad in the vertex shader. But in order to prevent a data race for each thread in the compute shader, I will probably need to use some kind of atomic writes into this buffer, which I have no idea how to do. So I'd really appreciate some guidance of what the best option here is, and if it would be option 2, then how I could prevent a data race when writing to the counter buffer. Or maybe there's a much better way to handle this and I'm just overcomplicating it?

Graphics and GPU Programming Programming DX12

Started by fighting_falcon93 April 29, 2020 07:17 PM

34 comments, last by NikiTo 4 years, 9 months ago

NikiTo

245

April 30, 2020 09:05 AM

@joej to write to the counters, the less threads simultaneously the better. To flush the data to the data buffer the more threads simultaneously the better.

I agree on IFs.

NikiTo

245

April 30, 2020 09:29 AM

8K indices in LDS is probably too radical for most of the situations. I would go for a group of 1024 and a buffer in LDS of 1024 too, maybe. But tests need to be done to find the best ratios.

fighting_falcon93

Author

April 30, 2020 09:32 AM

Thank you both for the replies and I think that I understand now.

I have one more question. This thread got me thinking, do I actually need a separate index buffer at all? Would it maybe be possible to “defrag” the particle buffer each time I run the compute shader, so that all active particles are to the left in the buffer, and all inactive particles are to the right?

Basically, each thread would read the data of one particle into a local struct. Then it would update the local struct. When updated, it would use a counter to write the local struct back into the particle buffer, but on a new index. Active particles would use a counter from the left, and inactive particles would use a counter from the right.

Do you think that this would work? What makes me sceptical is that we can't control the order of which thread groups are executed, right? So for example, if particle B gets read and updated before particle A has been read, then particle B would overwrite particle A, right?

JoeJ

4,406

April 30, 2020 09:39 AM

NikiTo said:
8K indices in LDS is probably too radical for most of the situations. I would go for a group of 1024 and a buffer in LDS of 1024 too, maybe. But tests need to be done to find the best ratios.

You want to use large workgroups only if you really have to, because occupancy becomes worst. Mostly 256, 64, 128 are the sweet spots.
Particle processing means high BW and low ALU, so having high occupancy is most important.

IIRC, on GCN you can allocate 8 32bit values of LDS per thread without occupancy penalty, but i may remember this wrong.
Good profiling tools tell such limits or propose how reducing LDS / registers could give better occupany. So using them it's not necessary to memorize such HW specs.

NikiTo

245

April 30, 2020 09:41 AM

@JoeJ Yes, you are right. there is too few ALU work. Nice catch!

fighting_falcon93

Author

April 30, 2020 09:44 AM

Also, I can't write the group shared counter value back into VRAM until all threads and all thread groups are finished, but how do I know when all threads and all thread groups are finished?

The function GroupMemoryBarrierWithGroupSync waits for all threads within a group, but how do we know that it's the last group?

JoeJ

4,406

April 30, 2020 09:49 AM

fighting_falcon93 said:
Do you think that this would work? What makes me sceptical is that we can't control the order of which thread groups are executed, right? So for example, if particle B gets read and updated before particle A has been read, then particle B would overwrite particle A, right?

This surely works, but it requires twice the memory so you can ping pong from prev to next frame. Algorithm would be similar, just have a second LDS buffer for inactive particles. Then write active particles at the beginning of the (next frames global) buffer, and write inactive particles at the end with decreasing index.
I think it would work with a single dispatch.

But it causes constant memory BW for all particles - you copy them all at each frame, so this loss might be bigger than the win of getting rid of indirection. But not sure - may depend on how many particles are active rendered on average.

NikiTo

245

April 30, 2020 09:52 AM

@fighting_falcon93 Read again the comment where i explained it to you. It is all said there and works.

JoeJ

4,406

April 30, 2020 09:53 AM

fighting_falcon93 said:
Also, I can't write the group shared counter value back into VRAM until all threads and all thread groups are finished, but how do I know when all threads and all thread groups are finished?

The solution is: Each thread group atomically increases the global counter. (IndirectDrawVertexCount in my code)

So it is correct at any time, also after all groups are done. The final value is then simply the sum of all stored batches.

NikiTo

245

April 30, 2020 09:56 AM

@JoeJ Also, no need for two buffers. A single buffer and two counters. Just the discarded particles are pushed from the contrary side. And it never overflows.

Indirect Draw For Particle System In DX12

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Indirect Draw For Particle System In DX12

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines