JoeJ said:
Let's say we have 100 characters all of the same mesh, but each has a different pose, and we pre-transform all the vertices. Would instancing then still be some potential win, just because the mesh topology is the same? If so, why?
You broke the base premise in the first sentence - the assumption is that we have thousands of soldiers on screen (which is common large-scale battles like in Total War games).
Each of the soldiers is a deformable mesh - but there will be significant amount of those in the same pose and configuration. For each of those it makes sense to compute the skinned geometry only once and draw instances of the soldier. In this case instancing is clearly potential win - as it reduces the total amount of memory needed, total amount of memory written/read and total amount of flops for computing the deformed geometry.
If the geometry is different F.e. 2 dogs, one is walking, another is eating → instancing is not possible in this case for deformed geometry and RT (unless you'd do variant of solution that I proposed).
JoeJ said:
It's probably not practical, and thus we might never see RT support for tessellation or mesh shaders. Although MS mentioned this as a potential future feature.
For tessellation shaders there might be a way - but there would be some limitations imposed what you can do within the tessellation shader, and some additional inputs might be needed. It's either a dynamic-LOD solution, but that will need on-demand BVH building, or precompute BVH (which kind of kills the purpose of dynamic tessellation though).
The on-demand building will require caching it and will inevitably lead to stalls. It can be fine for non-real time rendering, but for interactive/real time … nah.
Mesh shaders support is even more complicated. While it is a compute shader and you could output it in a buffer (and then use) - which is basically what everyone ray tracing dynamic geometry uses now, it has a big problem of working on meshlets. These do not map good to high quality acceleration structure leaves - which inevitably leads towards (sometimes very significant) performance drops.
JoeJ said:
Let's say we generate a small patch of geometry once a ray hits its bounding box.
…
JoeJ said:
Do we need to build a mini BVH as well? Otherwise we have to brute force over all triangles.
It is necessary to build acceleration structure (on demand in this case), unless we operate only on bunch of triangles (and even then - if they are scattered it will cause massive increase of ray intersection tests in large area. Bad for performance.
This BVH could be cached, BUT, that doesn't mean it is good and other rays will need to wait for it before the build during traversal is finished (i.e. stalls).
JoeJ said:
How do we cache this geometry so it is still available when other rays hit the box later?
There is only way to cache it - in memory. Yeah… that means adding memory latency to all operations when building it. Using cache would only work for current group of rays. Building acceleration structure is complex operation, you don't want to repeat that “per-workgroup”.
JoeJ said:
How can we put the ray on pause until the generation and building is completed?
Stalls… Even if you would swap current ray on threads in workgroup and traverse different ray instead (which itself is crazy - and would end up in massive register pressure due to having traversal stack and other local variables stored multiple times per thread), nothing guarantees you that new ray won't intersect with the same node in BVH.
Therefore you would need to wait.
JoeJ said:
That's all really complicated i think. I guess it's better to pre-transform/build everything, but using LOD to keep memory and workload bounded. This way we solve the caching problem with long termed reuse in mind, and avoid wasting resources to do it for every frame or not at all.
Which is why I still use the approach with additional buffers. Deformable geometry has input geometry and deformation data, and N output buffers (one per possible instance of deformable geometry). Also, culling won't help you here at all - so you have to have a buffer per single object with given deformable geometry in scene.