@joej to write to the counters, the less threads simultaneously the better. To flush the data to the data buffer the more threads simultaneously the better.
I agree on IFs.
@joej to write to the counters, the less threads simultaneously the better. To flush the data to the data buffer the more threads simultaneously the better.
I agree on IFs.
8K indices in LDS is probably too radical for most of the situations. I would go for a group of 1024 and a buffer in LDS of 1024 too, maybe. But tests need to be done to find the best ratios.
Thank you both for the replies and I think that I understand now.
I have one more question. This thread got me thinking, do I actually need a separate index buffer at all? Would it maybe be possible to “defrag” the particle buffer each time I run the compute shader, so that all active particles are to the left in the buffer, and all inactive particles are to the right?
Basically, each thread would read the data of one particle into a local struct. Then it would update the local struct. When updated, it would use a counter to write the local struct back into the particle buffer, but on a new index. Active particles would use a counter from the left, and inactive particles would use a counter from the right.
Do you think that this would work? What makes me sceptical is that we can't control the order of which thread groups are executed, right? So for example, if particle B gets read and updated before particle A has been read, then particle B would overwrite particle A, right?
NikiTo said:
8K indices in LDS is probably too radical for most of the situations. I would go for a group of 1024 and a buffer in LDS of 1024 too, maybe. But tests need to be done to find the best ratios.
You want to use large workgroups only if you really have to, because occupancy becomes worst. Mostly 256, 64, 128 are the sweet spots.
Particle processing means high BW and low ALU, so having high occupancy is most important.
IIRC, on GCN you can allocate 8 32bit values of LDS per thread without occupancy penalty, but i may remember this wrong.
Good profiling tools tell such limits or propose how reducing LDS / registers could give better occupany. So using them it's not necessary to memorize such HW specs.
Also, I can't write the group shared counter value back into VRAM until all threads and all thread groups are finished, but how do I know when all threads and all thread groups are finished?
The function GroupMemoryBarrierWithGroupSync waits for all threads within a group, but how do we know that it's the last group?
fighting_falcon93 said:
Do you think that this would work? What makes me sceptical is that we can't control the order of which thread groups are executed, right? So for example, if particle B gets read and updated before particle A has been read, then particle B would overwrite particle A, right?
This surely works, but it requires twice the memory so you can ping pong from prev to next frame. Algorithm would be similar, just have a second LDS buffer for inactive particles. Then write active particles at the beginning of the (next frames global) buffer, and write inactive particles at the end with decreasing index.
I think it would work with a single dispatch.
But it causes constant memory BW for all particles - you copy them all at each frame, so this loss might be bigger than the win of getting rid of indirection. But not sure - may depend on how many particles are active rendered on average.
@fighting_falcon93 Read again the comment where i explained it to you. It is all said there and works.
fighting_falcon93 said:
Also, I can't write the group shared counter value back into VRAM until all threads and all thread groups are finished, but how do I know when all threads and all thread groups are finished?
The solution is: Each thread group atomically increases the global counter. (IndirectDrawVertexCount in my code)
So it is correct at any time, also after all groups are done. The final value is then simply the sum of all stored batches.