SVOGI Implementation Details

Josh Klint · 2022-09-20T06:25:16

I'm on the last leg of my implementation of sparse voxel octree global illumination. Currently I am working on the specular reflection. I'm seeing reflections that look like below: This is using a single cone sample along the reflection vector. (To do GI, you would take several cone samples along the surface normal.) My question is how do I take the reflection data I have and turn that into a soft pleasing approximation of specular reflection? All the material I have read skips over this part of the implementation. I understand the technique isn't exact and there is a bit of artistry required here, but how to improve this?

Josh Klint

Author

1,472

February 02, 2022 03:13 PM

You still need the sparse octree because it massively reduces memory bandwidth, and during the lighting phase you only have to process the solid voxels.

If that data then gets copied into a volume texture, rendering to all slices of that volume texture is going to be the new bottleneck…

10x Faster Performance for VR: www.ultraengine.com

Vilem Otte

3,390

February 02, 2022 08:52 PM

JoeJ said:
It's kinda depressing you face the usual problems, making volume texture a win over acceleration structures again. :(

I can only follow up on this. For my whole implementation I use standard volume texture(s) (you can cascade them - and have cascaded volume texture for LOD is a winner here … which technically is a sort of ‘larger blocks’ taken to an extreme). SVO implementation was just way too slow - yes, it allows for very high resolution of results, but cascading the volume does the same thing - better and faster. And due to no duplication it also beats original Crassin's approach in pretty much all scenarios (at least in my implementation).

I'm currently able to do full resolution GI and specular reflections next to the rest of the pipeline for moderately sized scenes (I had some test videos with Sponza up somewhere - using temporal filtering on voxel data with some factor yielded the best smooth results even for animated scenes), even on few years old hardware (on my current box with Radeon 6800 it's holding stable 60fps at 4K resolution).

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

Josh Klint

Author

1,472

February 03, 2022 06:19 AM

It looks like a choice must be made:

Sparse Voxel Octree

Voxelized on CPU
Low memory usage
GI is done with light propagation
High-resolution reflections, almost like raytracing, but blurry reflections not supported
High latency (animation and motion supported poorly)

Cascaded Volume Textures

Voxelized on the GPU
High memory usage
GI is done with cone step tracing
Blurry reflections supported, sharp reflections not as high-res as SVO
Low latency (animation and motion supported well)

To me, the soft indirect specular reflections are the most impressive aspect of this, so I lean towards the second option. And having sharp reflections of the voxelized scene is not that great.

10x Faster Performance for VR: www.ultraengine.com

JoeJ

4,399

February 03, 2022 10:55 AM

I can add some downsides:

Octree:

Difficult to include dynamic objects and characters, so while the sharp reflections are impressive, it would only amplify the bad impression / artifacts of missing those on dynamic stuff or using screenspace for them.
Light propagation happening on lower LOD would miss out the given extra accuracy, so no real visual win over cone tracing. Though, LPV can handle participating media.

Volume:

Cone tracing means very approximated occlusion, causing light leaks and restrictions on level design. (But i had this same issue when i experimented with light propagation cascades)
Grid alignment can be noticeable even if modeling only low frequencies.
No robust differentiation of surface vs. volume, so high quality GI is not really possible.

I would tend towards the latter maybe, but i would not be happy.

Josh Klint

Author

1,472

February 09, 2022 08:02 PM

I have completed voxelization taking place on the GPU:

Is there some clever trick for efficiently downsampling the volume texture? It seems like going through a volume texture and rendering a pass for each slice would be pretty slow.

10x Faster Performance for VR: www.ultraengine.com

JoeJ

4,399

February 09, 2022 09:48 PM

Josh Klint said:
Is there some clever trick for efficiently downsampling the volume texture? It seems like going through a volume texture and rendering a pass for each slice would be pretty slow.

I would use a compute shader, read a small volume block to LDS, and generate 2-3 (or more) mips from that.
The higher mips mean idle threads, but lower bandwidth and less dispatches.

Edit: This could also do texture compression (which i guess also works with 3D textures).

Vilem Otte

3,390

February 09, 2022 11:59 PM

What I currently do (no texture compression - for a reason though):

cbuffer InputDimensions : register(b0)
{
	uint3 dimensions;
}

cbuffer InputMiplevels : register(b1)
{
	uint srcMiplevel;
	uint miplevels;
	float texelSize;
}

SamplerState srcSampler : register(s0);
Texture3D<float4> srcLevel : register(t0);
RWTexture3D<float4> mipLevel1 : register(u0);

groupshared float4 tmp[8];

void StoreColor(uint idx, float4 color)
{
	tmp[idx] = color;
}

float4 LoadColor(uint idx)
{
	return tmp[idx];
}

float HasVoxel(float4 color)
{
	return color.a > 0.0f ? 1.0f : 0.0f;
}

// Naive version to generate single mipmap level from the previous one
//
// Runs in 2x2x2 workgroup
[numthreads(2, 2, 2)]
void GenerateMipmaps(uint GI : SV_GroupIndex, uint3 DTid : SV_DispatchThreadID)
{
	// Each thread in workgroup loads single voxel and stores it on groupshared memory
	float4 src0;
	float4 src1;
	float4 src2;
	float4 src3;
	float4 src4;
	float4 src5;
	float4 src6;
	float4 src7;
	float3 uvw = (DTid.xyz + 0.5f) * texelSize;
	src0 = srcLevel.SampleLevel(srcSampler, uvw, (float)srcMiplevel);
	StoreColor(GI, src0);
	GroupMemoryBarrierWithGroupSync();

	// For first thread in workgroup only
	if (GI == 0)
	{
		// Load all 8 colors from shared memory
		src1 = LoadColor(GI + 0x01);
		src2 = LoadColor(GI + 0x02);
		src3 = LoadColor(GI + 0x03);
		src4 = LoadColor(GI + 0x04);
		src5 = LoadColor(GI + 0x05);
		src6 = LoadColor(GI + 0x06);
		src7 = LoadColor(GI + 0x07);

		// Perform mipmapping function
		float div = HasVoxel(src0) + HasVoxel(src1) + HasVoxel(src2) + HasVoxel(src3) + HasVoxel(src4) + HasVoxel(src5) + HasVoxel(src6) + HasVoxel(src7);

		if (div == 0.0f)
		{
			src0 = 0.0f;
		}
		else
		{
			src0 = (src0 + src1 + src2 + src3 + src4 + src5 + src6 + src7) / div;
		}

		// Store value
		mipLevel1[DTid / 2] = src0;
	}
}

This is a single level variant (I also have variants which build 2 and 3 miplevels at once (where you work similarly to the above code, just use a (4,4,4) group (or (8,8,8)) - this reduces the number of times compute needs to be invocated, properly select which thread is first (use masks) and calculate. This is basically equivalent on how you can do mipmap generation also in 2D images.

In terms of performance… Voxelization for Crytek Sponza (realtime) incl. mipmapping with multiple dynamic objects and few skinned meshes takes about 1ms for 256^3 and about 5ms for 512^3. I believe you can do better than me as I'm writing quite some data into the 3D texture. This timing is on Radeon 6800, on previous GPU I had (Radeon 590) the timing was about the same - which I believe points out to be fillrate dependent, and not compute dependent. Resolving GI is MUCH faster on new 6800.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

JoeJ

4,399

February 10, 2022 08:58 AM

Vilem Otte said:
[numthreads(2, 2, 2)]

This would mean only 8 out of 64 or 32 threads do some work, the rest remains idle?
Or does API and compiler automatically pack multiple small workgroups so a whole wavefront is busy?

I think the programmer has to do this on his own, and using ‘too small’ workgroups is a typical beginner mistake.
But some time back i proposed this to @taby , which did the same ‘mistake’ of using (1,1,1) workgroups in a isosurface shader. He fixed it, but no speedup.
I could not believe, and fixed it on my own as well from his github project, but no speedup indeed. Seems compiler is smarter than i think?

Maybe you can lift this long standing mystery. I'm definitively not the only one stumbling over this. :D

Vilem Otte

3,390

February 10, 2022 12:54 PM

So - this code practically keeps all 8 threads busy for a bit, and then 7 of those idle, pretty much until a single workgroup finishes. In case of “2-levels-in-kernel" you have 64 threads (4x4x4), where 64 are busy for starters, then only 8 are busy and then just a single one again. In case of "3-level-in-kernel" you have 512 threads (8x8x8) - but all are busy just for a tiny fraction (first load), then it is just 64 busy, 8 busy and 1 busy. Generating mip maps this way, as you clearly see, ends up in quite a lot of idle time.

Now temporal comparison of 3 variants I use:

VoxelMipmap.hlsl - https://pastebin.com/cNwKn93R - 8.70ms

VoxelMipmap2.hlsl - https://pastebin.com/4a8SasbG - 2.90ms

VoxelMipmap3.hlsl - https://pastebin.com/wxxY62Zq - 2.98ms

This ran on Radeon 6800, which with RDNA has 32 threads in wavefront. I sadly don't have an option to run in on NVidia gpu to compare. I might be able to measure GCN results (where if I remember correctly the VoxelMipmap3.hlsl variant was the fastest).

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

JoeJ

4,399

February 10, 2022 01:23 PM

Vilem Otte said:
Generating mip maps this way, as you clearly see, ends up in quite a lot of idle time.

That's unavoidable, but you could compact 8 x (2,2,2) into a single GCN wavefront manually to get a potential 8 x speedup regardless.
Even the example is memory bound, i would be sure it's a win. (If there was not this strange example of Tabys shader before, which was OpenGL).
Drivers doing such compaction automatically is highly unlikely, as it would break subgroup functions.

Vilem Otte said:
RDNA has 32 threads in wavefront.

There is also the mode to join two 32 groups into a 64 threads wavefront. I forgot how they call it. On PC we never know which choice the driver makes, but it was said compute shaders mostly use 64 mode still, while other shaders mostly use 32.
Probably CS uses 32 if the workgroup size is smaller than 64 ofc.

One thing i never tested personally is to compact busy vs. idle threads.
E.g. we have a workgroup of 256 threads, and after some time only 64 keep busy.
If we manage to have all 64 busy threads in sequence 0-63, and the idle threads at indices 64-255, ideally the GPU can skip over the idle wavefronts quickly and there is no need to execute multiple instructions for wavefronts completely masked out.
Not sure if this really works, but i know a guy who tried it on GCN and he said it helped.

SVOGI Implementation Details

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

SVOGI Implementation Details

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines