Advertisement

Is it a good choice that using a geometry shader to implement one-pass-shadowmap generating?

Started by June 02, 2018 05:33 PM
2 comments, last by turanszkij 6 years, 8 months ago

In traditional way, it needs 6 passes for a point light and many passes for cascaded shadow mapping to generate shadow maps. Recently I learnt a method that using a geometry shader to generate all the shadow maps in one pass.I specify a render target and a depth-stencil buffer which are both Texture2dArray in DirectX11.It looks much better than the traditional way I think.But after I implemented it, I found cascaded shadow mapping runs much slower than the traditional way.The fps slow down from 60 to 35.I don't know why.I guess may be I should do some culling or maybe the geometry shader is not efficient.

I want to know the reason that I reduced the drawcalls from 8 to 1, but it runs slow down.Should I abandon this method or is there any way to optimize this method to run more efficiently than multi-pass rendering?

Here is the gs code:


[maxvertexcount(24)]
void main(
    triangle DepthGsIn input[3] : SV_POSITION,
    inout TriangleStream< DepthPsIn > output
)
{
    for (uint k = 0; k < 8; ++k)
    {
        DepthPsIn element;
        element.RTIndex = k;

        for (uint i = 0; i < 3; ++i)
        {
            float2 shadowSlopeBias = calculateShadowSlopeBias(input.normal, -g_cameras[k].world[1]);
            float shadowBias = shadowSlopeBias.y * g_cameras[k].shadowMapParameters.x + g_cameras[k].shadowMapParameters.y;

            element.position = input.position + shadowBias * g_cameras[k].world[1];
            element.position = mul(element.position, g_cameras[k].viewProjection);
            element.depth = element.position.z / element.position.w;
            
            output.Append(element);
        }

        output.RestartStrip();
    }
}

 

With graphics programming it's always important to break down your performance into CPU performance and GPU performance. The CPU and GPU run concurrently with each other,  so it's typical that one will take longer than the other to complete a frame. If the GPU is taking longer than the CPU we call it being "GPU-bound", and if its the other way around we call it being "CPU-bound". It's good practice to build your own in-engine tools for measuring CPU performance with a high-resolution timer, and for measuring GPU performance with timestamp queries. External CPU and GPU profiling tools like PIX, Nvidia Nsight, and VTune can also help you to gather the necessary information. Alternatively, you can often quickly determine if you're CPU or GPU-bound through simple experimentation.

So lets now look at your situation specifically. The GS technique that you've used can drastically reduce draw calls (by up to 6x for the cubemap case). Draw calls tend increase your CPU performance, but often won't have much effect in your GPU performance. In other words, it will probably improve your overall frame time if you're CPU-bound, but isn't likely to help if you're GPU-bound. The bad part about the GS technique (and main reason why it's infrequently used) is that some aspects of the GS can be difficult to implement on a GPU in an efficient way. In particular, having any kind of geometry amplification (which is what you're doing in your GS) can be really slow, since it doesn't play nicely with processing lots of triangles in parallel (it's especially tricky for GPU's since D3D requires that triangles output by the GS get rasterized in-order). In particular AMD has historically had problems with GS performance, which is mostly due to their implementation having to spill all of the triangles to memory before rasterizing. You may have more luck with GS instancing, but I've never used that myself and I don't know if it's actually more optimal for existing GPU's.

AMD and Nvidia have some "extensions" for their recent GPU's that let you do some neat tricks. Nvidia has their "Fast GS" available through NVAPI, which supports using a viewport mask to "broadcast" a triangle to multiple viewports or RT slices. They actually have a sample that uses this for cascaded shadow maps, but unfortunately the code uses OpenGL and not D3D. Meanwhile, AMD has API's for letting you specify a viewport/RT broadcast mask from the CPU, and lets you get the index in the shader to transform the vertices differently (which suggests that its more of an instancing API). 

Advertisement
On ‎6‎/‎2‎/‎2018 at 6:33 PM, Royma said:

In traditional way, it needs 6 passes for a point light and many passes for cascaded shadow mapping to generate shadow maps. Recently I learnt a method that using a geometry shader to generate all the shadow maps in one pass.I specify a render target and a depth-stencil buffer which are both Texture2dArray in DirectX11.It looks much better than the traditional way I think.But after I implemented it, I found cascaded shadow mapping runs much slower than the traditional way.The fps slow down from 60 to 35.I don't know why.I guess may be I should do some culling or maybe the geometry shader is not efficient.

As MJP already pointed out, this will not help you with GPU limited scenes and it can even perform worse. Sounds like you are GPU limited after all. But you can optimize this with instancing. Instancing is usually a CPU side optimization to reduce draw calls, but in this case, it can be a GPU performance optimization. The idea is that you can reduce the number of triangles emitted by the geometry shader (which is usually the slow part) by creating an instance buffer that contains instances already culled by the shadow camera frustums and contain the index of the frustum they belong to. In the geometry shader, you will not emit triangles for each slice that it can belong to and rely on hardware clip space culling, but you only emit the triangle one time, but transformed by the shadow camera matrix specified by the instance data. This should already be a good win I suspect (but of course it depends on your scene). 

Also, what I am not sure of, that on some GPUs (DX 11.4 feature level) there is a feature called "Set VP & RT Array Index from any Rasterizer-feeding Shader" which is probably good for completely bypassing the geometry shader. I expect that would also be a nice improvement, but not widely supported feature yet.

This topic is closed to new replies.

Advertisement