Advertisement

Indirect Draw For Particle System In DX12

Started by April 29, 2020 07:17 PM
34 comments, last by NikiTo 4 years, 9 months ago

JoeJ said:
To be clear, you'd need to show from where you read your data and how you index it.

particle.xyzw = bufferInVRAM[(i * 64) + flatThreadID];

No problem - all the data is parsed and pushed. You can use 32 threads too.

It is just an example to show you it works. GPU hates large loops, so it will be super slow. But it works to prove my point.

I get you, but you still use an LDS buffer large enough to store indices for all particles, which becomes the ‘second buffer’ and so you also agree to my claim of doubling the memory requirement.

So, to defrag particles i see no reasonable alternative to having 2 (or one double sized) buffers.

A swapping pairs algorithm like bitonic sort could avoid this, but would need multiple iterations over all particles and dispatch + barrier for each iteration, so that's surely a big loss for performance.
Maybe it would be worth it if sorting by depth is necessary anyways, which reminds me on this paper: https://de.slideshare.net/DevCentralAMD/holy-smoke-faster-particle-rendering-using-direct-compute-by-gareth-thomas

Advertisement

JoeJ said:
I get you, but you still use an LDS buffer large enough to store indices for all particles, which becomes the ‘second buffer’ and so you also agree to my claim of doubling the memory requirement.

No no. The VRAM buffer has 4bln of particles.

JoeJ said:
Maybe it would be worth it if sorting by depth is necessary anyways, which reminds me on this paper:

I'm gonna reinvent that well….

NikiTo said:
No no. The VRAM buffer has 4bln of particles.

So you claim you have a 8bln sized buffer, containing 8bln particles, randomly active or inactive, and you can defrag this with time complexity of O(1) per article, without a need for other memory (except LDS).

Then i would like to see the full algorithm please. I'd certainly learn from this.

Being the travel from VRAM to CU a huge slowdown. Just compare a shader that reads from 4bln pixels and operates on them one by one, and a shader that eats chunks of 8K… HUGEEEEE difference!!!

No program is about just sorting. No program is about just adding two arrays. No program is so simple. The real speedup comes from completely re-designing an algorithm to live comfortably inside the GPU.

This topic is closed to new replies.

Advertisement