Advertisement

Indirect Draw For Particle System In DX12

Started by April 29, 2020 07:17 PM
34 comments, last by NikiTo 4 years, 9 months ago

@fighting_falcon93 What do you need the discarded particles for?!

(my battery is dying)

NikiTo said:
@JoeJ Also, no need for two buffers. A single buffer and two counters. Just the discarded particles are pushed from the contrary side. And in never overflows.

It is a very simple in place sort algorithm, which always requires two buffers even if single threaded? (Otherwise you overwrite data before it has been processed)

You can use one buffer twice as large ofc.

BTW, i use this seperated lists at begin and end of buffer a lot. To iterate all data, simply do:

for (int i = backCount; i < frontCount + bufferSize; i++)

{

int index = i & (bufferSize-1);

}

So in our example we would start processing the inactive particles that had been written with decreasing order to the back, and then switch to the active praticles whant i becomes ≥ bufferSize).
Ofc. bufferSize needs to be power of 2.

Advertisement

JoeJ said:
This surely works, but it requires twice the memory so you can ping pong from prev to next frame.

No, I meant with a single RWStructuredBuffer. I'm trying to avoid having 2 buffers since it would consume a lot of memory when the particle buffers need to be large.

Basically I meant:

uint readIndex;
InterlockedAdd(g_readIndex, 1, readIndex);

Particle particle;
particle = g_particleBuffer[readIndex];

particle.value1 = ...
particle.value2 = ...
particle.value3 = ...

uint writeIndex;
InterlockedAdd(g_writeIndex, 1, writeIndex);

g_particleBuffer[writeIndex];

But I'm assuming this wouldn't work since one thread might overwrite a particle that hasn't yet been read into a local variable by another thread, right? Or would that be fixable with some sort of sync barrier?

If it would work, would it actually be faster, or would it be better to just use an index buffer?

JoeJ said:
Each thread group atomically increases the global counter.

Hmm, let's see if I've understood this correctly:

groupshared uint g_count;

The above will create a uint counter that will be shared not only between threads in a group, but also between all groups in a dispatch.

(For simplicity I made it one counter now, but this can be expanded into several counters, that are then summed together).

And each time a group is finished, the first thread in that group, will write the current value of g_count into the VRAM counter?

Have I understood this correctly?

NikiTo said:
What do you need the discarded particles for?!

There're 2 types of particles that I'm not interested in drawing, the dead ones, and the culled ones. However, the culled ones still needs to be updated, so they still need to be moved back into the buffer.

@JoeJ Definitively, one single buffer and two counters do the job. I coded it for my older shaders and it works it passed super expensive testing where i visualize all the writes. Definitively one buffer is enough

@fighting_falcon93 You are right. You need to keep the culled but alive particles, because they could be not culled the next frame. My bad.

NikiTo said:
Definitively, one single buffer and two counters do the job.

But how does it work when you write the particles back into the buffer but on another index?

For example, if you update the particle on index 0, and you determine that this particle is alive but culled. Then you write the particle back into the buffer, but at index 0 from the right side (i.e. the last element in the buffer). But what if that element was another particle that hasn't been updated yet?

JoeJ said:
It is a very simple in place sort algorithm, which always requires two buffers even if single threaded?

That was bullshit - many in place sort algos work by swapping pairs so don't need twice the memory.

But aside that, i'm confused what you guys have in mind. I myself think of having a big array like [1,2,3,4] and a parallel task to sort them odd first, even second: [3,1,2,4] or [1,3,24] - both correct. Unordered parallel task can do this only with a second memory for the result.

fighting_falcon93 said:
groupshared uint g_count; The above will create a uint counter that will be shared not only between threads in a group, but also between all groups in a dispatch.

No. Group shared is local on chip memory per CU, and each thread group reserves part of it that nothing else can see.
For global counters, you need to use a global buffer and atomics to global memory. (In my code example all undeclared memory is assumed to be global buffers.)

Advertisement

@joej Correct, in this case we don't care about the order of appearance. You missed us there.

I just wrote that chunk of code, and i feel bad to delete it now. So i will post it anyways, although it is not needed anymore, because the misunderstanding is now cleared.

If (flatThreadID == 0) {
visibleCounter = 0;
invisibleCounter = 1023;
}

If (particleIsDead == false) {

int oldCounterValue;
If (particleIsVisible == true) {
InterlockedAdd(visibleCounter, 1, oldCounterValue);
} else {
InterlockedAdd(invisibleCounter, -1, oldCounterValue);
}
sameSingleLDS[oldCounterValue] = particleIndex;
}

Never used negative InterlockedAdd, but i need it for simplicity in this example. Never tested, but should work. The idea is what matters.

NikiTo said:
I just wrote that chunk of code, and i feel bad to delete it now. So i will post it anyways, although it is not needed anymore, because the misunderstanding is now cleared.

You are cheating : )

Because you assume a small amount of particles that fit into single workgroup and its LDS, and also the number of cores to be equal than the size of data, you do actually use twice of the memory, just the second half is registers temporary holding all the data so overwriting the same buffer is ok. ; )

On the CPU side, i could use only few registers to repeat the same situation. So it is not cheating.
In this case, registers are not storage.

By the way, this must be slow. Sometimes when i first time draft a shader, I am too lazy to do cascade a totalSum, so i just use one interlockedAdd and it is super mega slow. But i tolerate it while i draft the shader. Later i add cascading.

JoeJ said:
and also the number of cores to be equal than the size of data


The same code with only 64 threads and 8K buff. Still works. No need for a buffer the same size as the groupsize.

If (flatThreadID == 0) {
visibleCounter = 0;
invisibleCounter = (8*1024) - 1;
}


If (flatThreadID < 64) { //<<<<<<<<<<<<<<<<<<<<<
for(i=0; i<128; i++) //<<<<<<<<<<<<<<<<<<<<<
If (particleIsDead == false) {

int oldCounterValue;
If (particleIsVisible == true) {
InterlockedAdd(visibleCounter, 1, oldCounterValue);
} else {
InterlockedAdd(invisibleCounter, -1, oldCounterValue);
}
sameSingleLDS[oldCounterValue] = particleIndex;
}
}//<<<<<<<<<<<<<<<<<<<<<<<<<<<
}//<<<<<<<<<<<<<<<<<<<<<<<<<<<

Just as an example. The group declaration could limit it to (64,1,1).

NikiTo said:
The same code with only 64 threads and 8K buff. Still works. No need for a buffer the same size as the groupsize.

Disagree, but probably another misconception. To be clear, you'd need to show from where you read your data and how you index it. My assumption is 8K particels in an 8K buffer but nothing else. (If the buffer is in main or LDS memory does not matter for the example.)

This topic is closed to new replies.

Advertisement