Indirect Draw For Particle System In DX12

fighting_falcon93 · 2020-04-30T18:06:51

I'm working on a GPU based particle system in DX12. Each frame I run a compute shader that updates all the particles and appends the indices of the active particles into an index buffer. When it's time to render, I use this index buffer as a lookup table so that each particle can find its own data. Atleast that's the plan. The problem is that I don't know how I should handle the indirect drawing. As I see it, I currently face 2 problems: If I use the UAV counter from the index buffer, then I will make a draw call with 1 vertex per particle. This means I'd need to use the geometry shader to expand this single vertex into a quad. And from what I've heard, the geometry shader is quite slow and preferably avoided. If I instead use a separate counter that I increase by 4 for each particle that should be rendered, then I can use modulo and division to create a quad in the vertex shader. But in order to prevent a data race for each thread in the compute shader, I will probably need to use some kind of atomic writes into this buffer, which I have no idea how to do. So I'd really appreciate some guidance of what the best option here is, and if it would be option 2, then how I could prevent a data race when writing to the counter buffer. Or maybe there's a much better way to handle this and I'm just overcomplicating it?

Graphics and GPU Programming Programming DX12

Started by fighting_falcon93 April 29, 2020 07:17 PM

34 comments, last by NikiTo 4 years, 9 months ago

NikiTo

245

April 30, 2020 09:59 AM

@fighting_falcon93 What do you need the discarded particles for?!

(my battery is dying)

JoeJ

4,406

April 30, 2020 10:03 AM

NikiTo said:
@JoeJ Also, no need for two buffers. A single buffer and two counters. Just the discarded particles are pushed from the contrary side. And in never overflows.

It is a very simple in place sort algorithm, which always requires two buffers even if single threaded? (Otherwise you overwrite data before it has been processed)

You can use one buffer twice as large ofc.

BTW, i use this seperated lists at begin and end of buffer a lot. To iterate all data, simply do:

for (int i = backCount; i < frontCount + bufferSize; i++)

{

int index = i & (bufferSize-1);

}

So in our example we would start processing the inactive particles that had been written with decreasing order to the back, and then switch to the active praticles whant i becomes ≥ bufferSize).
Ofc. bufferSize needs to be power of 2.

fighting_falcon93

Author

April 30, 2020 10:17 AM

JoeJ said:
This surely works, but it requires twice the memory so you can ping pong from prev to next frame.

No, I meant with a single RWStructuredBuffer. I'm trying to avoid having 2 buffers since it would consume a lot of memory when the particle buffers need to be large.

Basically I meant:

uint readIndex;
InterlockedAdd(g_readIndex, 1, readIndex);

Particle particle;
particle = g_particleBuffer[readIndex];

particle.value1 = ...
particle.value2 = ...
particle.value3 = ...

uint writeIndex;
InterlockedAdd(g_writeIndex, 1, writeIndex);

g_particleBuffer[writeIndex];

But I'm assuming this wouldn't work since one thread might overwrite a particle that hasn't yet been read into a local variable by another thread, right? Or would that be fixable with some sort of sync barrier?

If it would work, would it actually be faster, or would it be better to just use an index buffer?

JoeJ said:
Each thread group atomically increases the global counter.

Hmm, let's see if I've understood this correctly:

groupshared uint g_count;

The above will create a uint counter that will be shared not only between threads in a group, but also between all groups in a dispatch.

(For simplicity I made it one counter now, but this can be expanded into several counters, that are then summed together).

And each time a group is finished, the first thread in that group, will write the current value of g_count into the VRAM counter?

Have I understood this correctly?

NikiTo said:
What do you need the discarded particles for?!

There're 2 types of particles that I'm not interested in drawing, the dead ones, and the culled ones. However, the culled ones still needs to be updated, so they still need to be moved back into the buffer.

NikiTo

245

April 30, 2020 10:24 AM

@JoeJ Definitively, one single buffer and two counters do the job. I coded it for my older shaders and it works it passed super expensive testing where i visualize all the writes. Definitively one buffer is enough

@fighting_falcon93 You are right. You need to keep the culled but alive particles, because they could be not culled the next frame. My bad.

fighting_falcon93

Author

April 30, 2020 10:52 AM

NikiTo said:
Definitively, one single buffer and two counters do the job.

But how does it work when you write the particles back into the buffer but on another index?

For example, if you update the particle on index 0, and you determine that this particle is alive but culled. Then you write the particle back into the buffer, but at index 0 from the right side (i.e. the last element in the buffer). But what if that element was another particle that hasn't been updated yet?

JoeJ

4,406

April 30, 2020 10:52 AM

JoeJ said:
It is a very simple in place sort algorithm, which always requires two buffers even if single threaded?

That was bullshit - many in place sort algos work by swapping pairs so don't need twice the memory.

But aside that, i'm confused what you guys have in mind. I myself think of having a big array like [1,2,3,4] and a parallel task to sort them odd first, even second: [3,1,2,4] or [1,3,24] - both correct. Unordered parallel task can do this only with a second memory for the result.

fighting_falcon93 said:
groupshared uint g_count; The above will create a uint counter that will be shared not only between threads in a group, but also between all groups in a dispatch.

No. Group shared is local on chip memory per CU, and each thread group reserves part of it that nothing else can see.
For global counters, you need to use a global buffer and atomics to global memory. (In my code example all undeclared memory is assumed to be global buffers.)

NikiTo

245

April 30, 2020 10:58 AM

@joej Correct, in this case we don't care about the order of appearance. You missed us there.

I just wrote that chunk of code, and i feel bad to delete it now. So i will post it anyways, although it is not needed anymore, because the misunderstanding is now cleared.

If (flatThreadID == 0) {
visibleCounter = 0;
invisibleCounter = 1023;
}
…
If (particleIsDead == false) {
…
int oldCounterValue;
If (particleIsVisible == true) {
InterlockedAdd(visibleCounter, 1, oldCounterValue);
} else {
InterlockedAdd(invisibleCounter, -1, oldCounterValue);
}
sameSingleLDS[oldCounterValue] = particleIndex;
}

Never used negative InterlockedAdd, but i need it for simplicity in this example. Never tested, but should work. The idea is what matters.

JoeJ

4,406

April 30, 2020 11:11 AM

NikiTo said:
I just wrote that chunk of code, and i feel bad to delete it now. So i will post it anyways, although it is not needed anymore, because the misunderstanding is now cleared.

You are cheating : )

Because you assume a small amount of particles that fit into single workgroup and its LDS, and also the number of cores to be equal than the size of data, you do actually use twice of the memory, just the second half is registers temporary holding all the data so overwriting the same buffer is ok. ; )

NikiTo

245

April 30, 2020 11:22 AM

On the CPU side, i could use only few registers to repeat the same situation. So it is not cheating.
In this case, registers are not storage.

By the way, this must be slow. Sometimes when i first time draft a shader, I am too lazy to do cascade a totalSum, so i just use one interlockedAdd and it is super mega slow. But i tolerate it while i draft the shader. Later i add cascading.

JoeJ said:
and also the number of cores to be equal than the size of data

The same code with only 64 threads and 8K buff. Still works. No need for a buffer the same size as the groupsize.

If (flatThreadID == 0) {
visibleCounter = 0;
invisibleCounter = (8*1024) - 1;
}
…

If (flatThreadID < 64) { //<<<<<<<<<<<<<<<<<<<<<
for(i=0; i<128; i++) //<<<<<<<<<<<<<<<<<<<<<
If (particleIsDead == false) {
…
int oldCounterValue;
If (particleIsVisible == true) {
InterlockedAdd(visibleCounter, 1, oldCounterValue);
} else {
InterlockedAdd(invisibleCounter, -1, oldCounterValue);
}
sameSingleLDS[oldCounterValue] = particleIndex;
}
}//<<<<<<<<<<<<<<<<<<<<<<<<<<<
}//<<<<<<<<<<<<<<<<<<<<<<<<<<<

Just as an example. The group declaration could limit it to (64,1,1).

JoeJ

4,406

April 30, 2020 12:16 PM

NikiTo said:
The same code with only 64 threads and 8K buff. Still works. No need for a buffer the same size as the groupsize.

Disagree, but probably another misconception. To be clear, you'd need to show from where you read your data and how you index it. My assumption is 8K particels in an 8K buffer but nothing else. (If the buffer is in main or LDS memory does not matter for the example.)

Indirect Draw For Particle System In DX12

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Indirect Draw For Particle System In DX12

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines