Advertisement

Mesh skinning through compute shader - Vertex buffer binding and draw calls

Started by January 10, 2024 05:11 PM
7 comments, last by Vilem Otte 10 months ago

Hi,

I want to implement mesh skinning in my custom engine, I would like to go with the compute shader technique approach.

As I understand, for a single mesh, the original vertex buffer is taken as input of the CS which outputs the skinned vertices in an additional output SSBO or VBO which is then used as vertex buffer for subsequent rendering tasks.

Does that mean that one needs one additional vertex buffer range per tuple (Mesh, Instance, AnimationState)?

Does that mean that one could not take advantage of the GPU instancing since vertex buffer offset is now different for each tuple?

TLDR

Do i need to allocate vertex space for each tuple (mesh,instance,animationState) when using compute shader for mesh skinning?

None

darkengines said:
Does that mean that one needs one additional vertex buffer range per tuple (Mesh, Instance, AnimationState)?

yes.

darkengines said:
Does that mean that one could not take advantage of the GPU instancing since vertex buffer offset is now different for each tuple?

yes.

But because all data is already on GPU, and you probably use indirect draw calls generated on GPU as well, there should be no more performance advantage from using instancing. So i would not worry, but i lack experience to be sure.

darkengines said:
Do i need to allocate vertex space for each tuple (mesh,instance,animationState) when using compute shader for mesh skinning?

yes.

CS skinning has higher memory cost, but offers some advantages:
We do skinning only once, which might matter if we do multiple passes of the same geometry.
We can do fine grained frustum and occlusion culling on the already transformed geometry.
We can build a BVH for raytracing. (Afaik, pre-transforming vertices is the only way to use RT with skinned meshes.)

Advertisement

JoeJ said:
But because all data is already on GPU, and you probably use indirect draw calls generated on GPU as well, there should be no more performance advantage from using instancing. So i would not worry, but i lack experience to be sure.

This is not entirely true. You can still use instancing. With indirect drawing you do have instance count field in buffer.

Why would you want to use that? There are quite some use cases - multiple objects having exactly same state (animation ID and frame) of skinned geometry. An easy example could be large armies like in battles in Total War games.

JoeJ said:
We do skinning only once, which might matter if we do multiple passes of the same geometry.

Which is generally done with shadow mapping techniques everywhere. Not to mention reflection passes, voxelization, etc.

JoeJ said:
We can do fine grained frustum and occlusion culling on the already transformed geometry.

While this sounds good - you still want to use only bounding geometry.

JoeJ said:
We can build a BVH for raytracing. (Afaik, pre-transforming vertices is the only way to use RT with skinned meshes.)

That's one of the main reasons (and to actually perform ray tracing - works for both - compute based and hardware ray tracing). As for the second thing … that's not entirely correct. Although as long as you have memory, you probably would prefer that.

What you could do is that you could determine bounding box encapsulating whole animated sequence of given geometry leaf. Then use that to build BVH. When intersecting the leaf - you would pre-transform vertex data within a given leaf with bones for given frame. Is that efficient/does that make sense? That's another thing... although there are probably some use cases where it would make sense.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

Thank you for your clarifications, now I can start the implementation 🙂.

None

Vilem Otte said:
This is not entirely true. You can still use instancing. With indirect drawing you do have instance count field in buffer.

Let's say we have 100 characters all of the same mesh, but each has a different pose, and we pre-transform all the vertices.
Would instancing then still be some potential win, just because the mesh topology is the same? If so, why?

Vilem Otte said:
What you could do is that you could determine bounding box encapsulating whole animated sequence of given geometry leaf. Then use that to build BVH. When intersecting the leaf - you would pre-transform vertex data within a given leaf with bones for given frame.

It's probably not practical, and thus we might never see RT support for tessellation or mesh shaders. Although MS mentioned this as a potential future feature.

Let's say we generate a small patch of geometry once a ray hits its bounding box.
Do we need to build a mini BVH as well? Otherwise we have to brute force over all triangles.
How do we cache this geometry so it is still available when other rays hit the box later?
How can we put the ray on pause until the generation and building is completed?

That's all really complicated i think.
I guess it's better to pre-transform/build everything, but using LOD to keep memory and workload bounded.
This way we solve the caching problem with long termed reuse in mind, and avoid wasting resources to do it for every frame or not at all.

JoeJ said:
Let's say we have 100 characters all of the same mesh, but each has a different pose, and we pre-transform all the vertices. Would instancing then still be some potential win, just because the mesh topology is the same? If so, why?

You broke the base premise in the first sentence - the assumption is that we have thousands of soldiers on screen (which is common large-scale battles like in Total War games).

Each of the soldiers is a deformable mesh - but there will be significant amount of those in the same pose and configuration. For each of those it makes sense to compute the skinned geometry only once and draw instances of the soldier. In this case instancing is clearly potential win - as it reduces the total amount of memory needed, total amount of memory written/read and total amount of flops for computing the deformed geometry.

If the geometry is different F.e. 2 dogs, one is walking, another is eating → instancing is not possible in this case for deformed geometry and RT (unless you'd do variant of solution that I proposed).

JoeJ said:
It's probably not practical, and thus we might never see RT support for tessellation or mesh shaders. Although MS mentioned this as a potential future feature.

For tessellation shaders there might be a way - but there would be some limitations imposed what you can do within the tessellation shader, and some additional inputs might be needed. It's either a dynamic-LOD solution, but that will need on-demand BVH building, or precompute BVH (which kind of kills the purpose of dynamic tessellation though).

The on-demand building will require caching it and will inevitably lead to stalls. It can be fine for non-real time rendering, but for interactive/real time … nah.

Mesh shaders support is even more complicated. While it is a compute shader and you could output it in a buffer (and then use) - which is basically what everyone ray tracing dynamic geometry uses now, it has a big problem of working on meshlets. These do not map good to high quality acceleration structure leaves - which inevitably leads towards (sometimes very significant) performance drops.

JoeJ said:
Let's say we generate a small patch of geometry once a ray hits its bounding box.

JoeJ said:
Do we need to build a mini BVH as well? Otherwise we have to brute force over all triangles.

It is necessary to build acceleration structure (on demand in this case), unless we operate only on bunch of triangles (and even then - if they are scattered it will cause massive increase of ray intersection tests in large area. Bad for performance.

This BVH could be cached, BUT, that doesn't mean it is good and other rays will need to wait for it before the build during traversal is finished (i.e. stalls).

JoeJ said:
How do we cache this geometry so it is still available when other rays hit the box later?

There is only way to cache it - in memory. Yeah… that means adding memory latency to all operations when building it. Using cache would only work for current group of rays. Building acceleration structure is complex operation, you don't want to repeat that “per-workgroup”.

JoeJ said:
How can we put the ray on pause until the generation and building is completed?

Stalls… Even if you would swap current ray on threads in workgroup and traverse different ray instead (which itself is crazy - and would end up in massive register pressure due to having traversal stack and other local variables stored multiple times per thread), nothing guarantees you that new ray won't intersect with the same node in BVH.

Therefore you would need to wait.

JoeJ said:
That's all really complicated i think. I guess it's better to pre-transform/build everything, but using LOD to keep memory and workload bounded. This way we solve the caching problem with long termed reuse in mind, and avoid wasting resources to do it for every frame or not at all.

Which is why I still use the approach with additional buffers. Deformable geometry has input geometry and deformation data, and N output buffers (one per possible instance of deformable geometry). Also, culling won't help you here at all - so you have to have a buffer per single object with given deformable geometry in scene.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

Advertisement

Vilem Otte said:
You broke the base premise in the first sentence - the assumption is that we have thousands of soldiers on screen (which is common large-scale battles like in Total War games).

I know. It's clear why instancing makes sense for models having the same mesh and deformation.
But if i broke premise, you did the same earlier already, since the question was about a tuple of (Mesh, Instance, AnimationState), and your example would then refer to multiple appearances of the same tuple.

In general that's probably a rare special case out of random, and won't affect decision making on engine development.
But ofc. it's worth to mention such special cases can exist in practice, like for foliage, to make another example.

Vilem Otte said:
For tessellation shaders there might be a way

NV already has this, sort of. Their DMM micro meshes, which is triangle subdivision with optional displacement.
They also use it to accelerate binary alpha test, by classifying each micro triangle to be fully transparent or opaque, or something in between requiring a texture based alpha test only where needed. Because the geometry layout is implicit and displacement is only scalar, there is no need for a mini BVH per triangle.

It's nice to amplify details, but sadly it does nothing against the larger problem that we can not reduce the base mesh without a need to rebuild the entire BLAS. It can do some LOD, but only in one direction: Moooaaarrr details, so more enthusiasts spend 2K$ for an oven with display port. Wake me up once they come up with real solutions…

Still, regarding detail amplification on the fly, i guess that's the most we can expect from a HW solution.

Vilem Otte said:
Building acceleration structure is complex operation, you don't want to repeat that “per-workgroup”.

I actually do this. :D
Each workgroup builds AC on the fly, then traces rays using only that, ignoring the global BVH i also need for other purposes.
Traversing global BVH would be much too slow for my case. Even now with HW acceleration i guess.
But now it's me talking about special cases : ) In general i agree.

JoeJ said:
But if i broke premise, you did the same earlier already, since the question was about a tuple of (Mesh, Instance, AnimationState), and your example would then refer to multiple appearances of the same tuple.

True, as long as they are unique in the tuples - you have to (as you said) rebuild each single one of those into buffer. Speaking about it being rare special case - practically it doesn't matter though, because command signature in indirect draw often contains instance count.

JoeJ said:
NV already has this, sort of. Their DMM micro meshes, which is triangle subdivision with optional displacement.

The question is what's the purpose of this - because with ray tracing it's not that big deal to throw high resolution geometry at it (as long as it's not deformable geometry, you can ideally pre-compute acceleration structures). Of course then there is a problem with whole hardware ray tracing which at this point done completely wrong (especially in regards to acceleration structures).

My bet on it is not being thoroughly thought through and made quickly as cash grab by NVidia with their RTX as a selling point.

The black box behavior is not just bad, it is blatantly wrong - and eventually the specs will change and implementations using it will have to be re-done (plus the big sigh on the driver size that will have to implement this “fixed-function” like behavior in future).

*This is even more sad from the developer's point of view who worked with ray tracing for quite long time - and used NVidias libraries (Aila & Laine one for example) - which had very solid foundations, were extensible, etc. … then we get this as a result.

Overall I could just throw away large parts of code we have for handling acceleration structures and tell every customer: "Just get newest NVidia high end card and hope it will increase performance a little bit. Do that every generation and eventually you will reach the performance you seek. I sadly haven't seen any official statement offering affiliate marketing for NVidia though - that would give me at least financial motivation to do this.

Note: Sorry for the rant… It's just sad state of ray tracing being used as cash grab, while not solving any problem we have in performance.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

This topic is closed to new replies.

Advertisement