Advertisement

SVOGI Implementation Details

Started by January 30, 2022 11:47 AM
121 comments, last by Josh Klint 2 years, 4 months ago

What I tried doing was using a vertex shader to process every pixel in the next volume texture mipmap. My volume texture is 256x256x256, so the next mip level is 128x128x128, or 2,097,152 pixels. I created a point mesh with that many vertices and ran a vertex shader with this code:

int resolution = 128;
ivec3 coord;

coord.x = gl_VertexIndex / (resolution * resolution);
coord.y = gl_VertexIndex / resolution - coord.x * resolution;
coord.z = gl_VertexIndex - (coord.x * resolution * resolution + coord.y * resolution);
vec4 color = textureLod(texture3DStorageSampler[textureID], vec3(ic.x,ic.y,ic.z) / float(resolution), 0.0f);
imageStore(imagearray3D[textureID + 1], coord, color);

The framerate dropped to single digits. I think my next attempt will be to use a fragment shader to draw 16 or so layers of the volume texture per pixel, per pass. So the 128x128x128 mipmap would use 8 passes with a fragment shader to downsample it. I could also render to slices of the volume texture, instead of using imageStore().

It looks like you are using a compute shader. Why would that be faster? Is there some optimal way to set up the downsample pass?

10x Faster Performance for VR: www.ultraengine.com

Josh Klint said:
It looks like you are using a compute shader. Why would that be faster? Is there some optimal way to set up the downsample pass?

With CS the memory you read is shared within the block of mips it generates. Any other shaders need to read the same data much more often.
A simple average from LDS should be faster even than texture filter from cached VRAM. TMU is pipelined (afaik), but LDS can be processed in parallel at full rate.
Also there is no need to setup attribute interpolators and triangles, no need to respect draw order, etc. All this rasterization stuff you do not really need here.

Advertisement

JoeJ said:
All this rasterization stuff you do not really need here.

It's not just rasterization - the full pipeline at least goes like this (in short):

Input Data → Vertex Shader → Primitive Assembly → Clipping → Rasterization → Fragment Shader → Output to frame buffer

This is just wasted computation power (a LOT of it - reminds me of a time when I lost bet and wrote my software rasterizer and pushed it to Github to realize how much work there is to do).

Anyways not to be completely off topic, in the sample code I've posted (and you can see same pattern in 2-level and 3-level version), when you read from the top:

...
// This declares a workgroup-shared memory of 8 floats (32 bytes) that is shared among the whole workgroup of 2x2x2 threads
groupshared float4 tmp[8];

void StoreColor(uint idx, float4 color)
{
	tmp[idx] = color;
}

float4 LoadColor(uint idx)
{
	return tmp[idx];
}

...

[numthreads(2, 2, 2)]
void GenerateMipmaps(uint GI : SV_GroupIndex, uint3 DTid : SV_DispatchThreadID)
{
...
	// Each thread in workgroup does this work. It loads a single voxel at specific voxel coordinate in 3D texture
	src0 = srcLevel.SampleLevel(srcSampler, uvw, (float)srcMiplevel);
	
	
// This value is then stored in groupshared memory at specific index
	StoreColor(GI, src0);
	
	
// At this point we wait for ALL threads within workgroup to finish memory access (in exact terms - we wait for ALL 8 threads running this kernel to finish writing into the groupshared tmp memory, once we continue past this line, tmp will have written copy of the 2x2x2 block of voxels from 3D texture at specific location
	GroupMemoryBarrierWithGroupSync();

	// Once this is done, only the first thread from the workgroup continues (as it needs to average value within groupshared memoyr of tmp (which is equivalent of average of that 2x2x2 block of voxels from original 3D texture
	if (GI == 0)
	{
...

In VoxelMipmap2.hlsl (and VoxelMipmap3.hlsl) this is even more significant. I only read from the texture once, and write once per level. All other operations are done within groupshared memory, this memory in general is going to be located on fast on-chip storage just like caches are in general (LDS for GCN architectures and RDNA architectures uses a separate memory (was there a limit of 64 KB for GCN? I think it was…), on some new NVidia ones if I'm not mistaken LDS shares physical memory with L1 on hardware side).

You won't need to go into VRAM (which is slow) apart from first read and for the writes.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

That's pretty cool.

I've never used compute shaders before but I've got it nearly working now.

10x Faster Performance for VR: www.ultraengine.com

I've got it working now, and the performance is good. I only use four mipmap levels because anything beyond that is not very useful, so it is actually possible to do all your downsampling in a single pass of a compute shader.

In my reflection code, I use the alpha value of the raycast result to determine how much of the skybox should show up in the PBR reflection. If the alpha value is 0.5, then the sky will be 50% visible in the reflection.

Any general tips on how to reduce light leakage and errors? I know this technique will never be 100% perfect, but I'd like to prevent the skybox from showing up inside buildings.

10x Faster Performance for VR: www.ultraengine.com

Josh Klint said:
Any general tips on how to reduce light leakage and errors?

This was the reason i gave up on my voxel GI experiments.
Surfels can prevent leakage robustly: I have a hierarchy of surfels on the surface, which looks like a quadtree. Going higher up the hierarchy, multiple disjoint surfaces merge and issues creep in.
But sticking at the surface example, 4 surfels usually merge to one larger parent surfel. So it's similar to mip maps and allows the same prefiltering.
If you imagine a wall with an area of 2x2 meter, the whole may be just one surfel. But because the surfel is flat, it can still approximate the thin wall and prevent leakage.
Contrary, with voxels we may have a 16^3 volume, and one slice of solid voxels to represent the same wall. A single voxel at the top of the mip chain will have low density, and there is no way to know from which direction light should not pass through.

I tried to improve this with including directional information, so i had a mip chain of normals times surface area encoded in SH2.
But this did not work either, because SH2 can not represent both the front and back sides of a wall which may all fall into a single voxel.
To encode front and back, we would need SH3, which already has 9 coeffcients, but still fails on complex geometry e.g. a corner where two walls meet at right angles.
So i gave up on it. The volumetric approach was much slower than the surface approach using surfels anyway, accuracy was much worse too, so the only advantage of volumes would had been simplicity.

But it's not that i would recommend surfels either. It took me years to make a preprocessing tool to build the surfel hierarchy.
I had to work on the very hard geometry problem of quadrangulation to get nice quad tree alike tree topology on arbitrary models.
The tracing also is harder, as it is the same as classical raytracing using BVH. Not as simple as sphere tracing in a grid.
Though, maybe we could use surfels with a world aligned grid for a com-promise. Would be simple, but a bad approximation of geometry, causing bad accuracy and the usual quantization issues.

Maybe it would be an idea to replace my spherical harmonics approach with some quantized format, which lacks directional accuracy but prevents leakage successfully. I could imagine something like Valves Ambient Cube, distributing a patch of surface to the 3 narmal aligned faces of a voxel cube. A higher mip of such format could cause overocclusion but prevent leakage, which would be a success.

Otherwise, the only option i see would be to use many rays instead a single cone, and have high resolution voxelization even for distant objects. But i guess that's still not practical at current day, and DXR would be just better in any way.

Advertisement

Btw, the problem is still haunting me to the current day. It just moved to the preprocessing side:

This is isosurface of volumetric Sponza model at low res. I use volume processing to get rid of hidden surfaces and avoiding other problems coming from partial / incomplete / disjoint game meshes.

Because the floor plate of the model is quite thin, it disappears at higher mips, causing a hole where the floor should be. >:(
Likely there will be some ground below the building, which would solve the issue, but i still need to work an a robust solution regardless.

This truly, truly sucks. Prepare for never ending source of voxel frustration… ; )

I've made good progress with this:

Using several cascading volume textures does a good job of keeping the memory usage low. A single 512x512x512 RGBA volume texture uses 512 MB, but if you use four 64x64x64 textures to cover the same area it only consumes 4 MB.

I think the final step will be to make the center of the cascaded volume textures follow the camera around. I think I will need to do an image copy to shift the existing contents of the volume textures over when they shift position.

10x Faster Performance for VR: www.ultraengine.com

I also want to note that I believe Crytek's work with sparse voxel raytracing is probably a generation behind what we're doing now. I played “Kingdom Come” and only saw diffuse GI. Since I have found indirect specular to be impractical with sparse voxel octrees, I don't think their tech supports this. They're probably hamstringed by console support, as we saw back when Leadwerks was the first game engine using deferred lighting.

Using the cascading volumes, I was able to reduce the memory usage to less than 1% what a single volume texture would be, which is in the same ballpark but a bit better than the numbers I was seeing with sparse voxel octrees (usually 0.5-8% memory usage).

So I think this is a final verdict on the matter of volume textures verses SVO.

10x Faster Performance for VR: www.ultraengine.com

I've got several cascaded stages following the camera around now, but the differences in lighting results between stages are very abrupt:

10x Faster Performance for VR: www.ultraengine.com

This topic is closed to new replies.

Advertisement