Advertisement

Sparse voxel octree traversal performance advantage?

Started by January 03, 2022 02:43 PM
13 comments, last by JoeJ 3 years ago

Very roughly, what kind of performance gains do you see using a sparse voxel octree for ray traversal, vs. just walking through grid cells one-by-one? With a 512x512x512 grid and 600,000 voxels, the CPU is taking 719 milliseconds with the step-through method and 300 milliseconds using the SVO method, which can eliminate big chunks of the scene at once.

At a lower resolution of 256x256x256 the SVO method is only slightly faster (130 vs 150 ms).

I was really expecting to see more like an order-of-magnitude improvement using the SVO. This is a very inexact description, but general speaking what has your experience been with the performance of SVOs vs. a dumb grid?

10x Faster Performance for VR: www.ultraengine.com

I use SVO and DAG for volume compression, but i lack experience with tracing that. Have it for BVH, though.
Probably the problem is the cache misses from the pointer chasing in traversal. A good solution is to share the traversal results, which is easy for range queries but hard for tracing.
For raytracing this could be implemented by ray packets, where a larger number of rays, e.g. bounded by a frustum or cone, process the same data which was expensive to find.
It's then a trade off between work efficiency (rays process redundant data they would't if traced in isolation) and better memory access (less random access, better chance to hide the remaining cache misses).

As the packet is processing, rays may reduce in number after hitting stuff, and after some time we no longer have enough rays to get any win. This can also cause empty SIMD lanes or GPU threads.
To counteract this, a reordering step might be used to form new packets from sets of still active rays. This could run each time after some fixed number of traversal steps.
Its an expensive task on it's own, complexity is quite insane, and again it depends on scene complexity and tracing workload at what point such things become worth it.
It's more common to do no reordering, but only make sure initial ray packets are made of pretty coherent rays.

Finally, it should help if your SVO nodes are ordered in memory by some cache friendly order, like morton order / Z-curve. But this usually happens automatically when building octrees, so i guess you do that already.

Edit: Oh, i forgot: Another way to reduce traversal costs is to use a larger branching factor. So instead a node having 2^3 children, it can have 4^3. This is referred as a ‘Sparse Grid’ from the fluid sim community. The top level can be a small regular grid instead a single root node. Otherwise the implementation remains the same as octree, so it should be easy to try. Due to the large branching factor, i often end up with only two levels of hierarchy even for large domains like 1024^3. Octree would need 9 levels for that? Quite a difference, and compression is still pretty good.

Advertisement

I think a big area of cost savings is the fact that 8 AABBs in a cluster can be evaluated more efficiently than 8 separate arbitrary AABBs.

Given the entry point of the ray, it should be possible to determine which voxels are intersected, and their entry point, using just three plane intersection tests…

10x Faster Performance for VR: www.ultraengine.com

Josh Klint said:
Given the entry point of the ray, it should be possible to determine which voxels are intersected, and their entry point, using just three plane intersection tests…

You also can setup DDA once per ray, then no plane intersections are necessary. Might be still faster on modern HW: https://www.scratchapixel.com/lessons/advanced-rendering/introduction-acceleration-structure/grid
I remember i did this for a software rasterizer for early smart phones. The scanline was a 2D ray, intersecting a quadtree representing a compressed lightmap texture. So i did not had to fetch this for every pixel, and it was faster although higher complexity.

It's also possible to determinate the sorted near to far order of visiting the children of a node just once per ray.

Okay, with the three plane intersections approach I got it down to 220 in my first test above. But I guess I can just sort of walk through the grid and round off the current floating point coordinate to the nearest integer coord. I think that is basically what your technique above is doing.

So tired…. XD

10x Faster Performance for VR: www.ultraengine.com

Josh Klint said:
But I guess I can just sort of walk through the grid and round off the current floating point coordinate to the nearest integer coord. I think that is basically what your technique above is doing.

The motivation of the DDA method is to precalculate intervals for plane intersections for each dimension, which then remain constant. After that, we can find all intersections in order just by doing additions, and - if we need multiple levels of a hierarchy - multiplications (or bit shifts if using integer math).
So you get rid of ‘expensive’ divisions, which my ARM based mobile for example could not even do in hardware. Not sure how's the win on modern hardware in a situation where memory is the bottleneck, but should be still something.

The same idea is used for example to implement the standard slab test to intersect a ray vs. an AABB:

void DistanceRayAABBFrontAndBackface (float &ffd, float& bfd, const vec& rayOrigin, const vec& rayInvDirection, const vec& aaBoxMin, const vec& aaBoxMax) const
		{
			vec t0 = vec(aaBoxMin - rayOrigin).MulPerElem (rayInvDirection); // those vector class functions boil down to a single SIMD instruction
			vec t1 = vec(aaBoxMax - rayOrigin).MulPerElem (rayInvDirection);
			vec tMin = t0.MinPerElem (t1);
			vec tMax = t0.MaxPerElem (t1);
			ffd = tMin.MaxElem(); // front face distance (behind origin if inside box) 
			bfd = tMax.MinElem(); // back face distance	
		}
	
// we do division just once to set up the ray for traversal	
vec rID (1.f / givenRayUnitDirection.x, 1.f / givenRayUnitDirection.y, 1.f / givenRayUnitDirection.z); 

// traversal...
DistanceRayFrontAndBackface (ffd, bfd, givenRayOrigin, rID, aaBoxMin, aaBoxMax);
if ((ffd <= bfd) & (ffd >= 0.0f) & (ffd <= rayLength))
{
	// box was hit
}

You see there is o need for a dot product, because our planes are axis aligned. (I've later found the exact same code in Newton Physics and other sources, seems everybody does it the same way.)
That's nice for BVH where boxes have arbitrary positions. But for Octree the box positions are also snapped to reguler intervals, so DDA is opportunity to optimize further.

Josh Klint said:
So tired…. XD

What do you do with this? Now when we have sooo awesome RT HW acceleration? :D

Advertisement

Efficiency never goes out of style. If we have great hardware that can do ray tracing, the same hardware can use SVOs to do more.

10x Faster Performance for VR: www.ultraengine.com

Josh Klint said:
Efficiency never goes out of style. If we have great hardware that can do ray tracing, the same hardware can use SVOs to do more.

But the HW we have now isn't great. Or to be precise - the API isn't.
Actually i can not even use DXR at all to trace my stuff, because my geometry has fine grained LOD. Somewhat similar to UE5 Nanite.
But the BVH is black boxed. You can not modify it to reflect geometry changes. You can only tell the API to rebuild it entirely from scratch, just because a small patch of surface has changed it's detail.
So, if you aim to solve one of the two main graphics problems, which is LOD, you can no longer use the HW built to solve the second, which is visibility.
Because it's completely impossible to rebuild BVH for the whole scene in realtime each frame. What were they thinking? Make a HW acceleration for ray tracing, prevent any flexibility so hinder progress more than helping it, just to make people pay premium for useless crap?

Fortunately Epic has the same problem, so there is a tiny chance they recognize their mistake and do some improvements. After 10 years, when the PC platform is already a thing of the past due to declaring enthusiast HW and prices the norm.

On consoles the situation is better. BVH isn't blackboxed but accessible. (ofc. that's easy for them because there is only one GPU vendor, but i would happily deal with multiple BVH formats if i could.)
I could just use my own BVH that i already have for both GI and LOD, stream it from disc, convert it to vendor data structures and it would just work. Without a need to ever build any BVH at runtime at all. Just like it should be.
Seemingly PC players don't need such efficient solutions. They have RTX3000$! That's so powerful, and would be bored if you could not throw some redundant work at it, like building damn acceleration structures for static content at runtime. Makes all sense.

(It feels so nice to rant about this over and over again. I'll continue until they admit their shortsightedness and failure :D )

In short: Efficiency did go out of style already since quite some time, like it or not. GPUs won't be able to support custom acceleration structures anytime soon, i'm afraid.
You would need to implement your own raytracing in compute, so bypassing acceleration.
For me that's indeed the most practical option :/

I did not understand the technique here at first, but I think I am coming to the same approach, from a slightly different understanding:
https://www.scratchapixel.com/lessons/advanced-rendering/introduction-acceleration-structure/grid

First I determine the major axis (the longest one) and I create a line equation for the remaining two axes as a function of the major axis, like everyone learned in high school geometry:

Y = mX + b

With that equation, you can determine the height at each point the line crosses cells horizontally, and you can reverse that equation to determine at what horizontal position the ray crosses each cell vertically.

Something like that. I will have to pick this back up tomorrow.

10x Faster Performance for VR: www.ultraengine.com

I'm currently using this function:
https://gamedev.stackexchange.com/a/18459

With 4 million voxels stored in a 2048x2048 texture, performance is slow:

10x Faster Performance for VR: www.ultraengine.com

This topic is closed to new replies.

Advertisement