local_size_x, etc. in OpenGL compute shader

Author

1,557

April 21, 2020 08:09 PM

I have written a compute shader that calculates a quaternion Julia set. The strange thing, is that it works best when I state: layout(local_size_x = 1, local_size_y = 1) in;

I tried bumping it up to 32x32, but I get a decrease in performance. This is on AMD and Intel. Is there any rhyme or reason behind this? Also, is there such a thing as inline functions in GLSL?

The shader as a whole looks like this:

#version 430 core

layout(local_size_x = 1, local_size_y = 1) in;
layout(binding = 0, r32f) writeonly uniform image2D output_image;
layout(binding = 1, rgba32f) readonly uniform image2D input_image;
uniform vec4 c;
uniform int max_iterations;
uniform float threshold;

// functions go here

vec4 iter_func(vec4 z)
{
    vec4 A0 = vec4(0, 0, 0, 0);
    vec4 A1 = vec4(0, 0, 0, 0);
    vec4 A2 = vec4(0, 0, 0, 0);
    vec4 S2_0 = vec4(0, 0, 0, 0);
    vec4 S2_1 = vec4(0, 0, 0, 0);
    vec4 S2_2 = vec4(0, 0, 0, 0);
    A0 = qcopy(z);
    A1 = qcopy(z);
    S2_0 = qsin(A0);
    S2_1 = qsin(A1);
    S2_2 = qmul(c, S2_1);
    S2_0 = qadd(S2_0, S2_2);
    A2 = qcopy(S2_0);
    z = qcopy(A2);

    return z;
}

float iterate(vec4 z)
{
    float threshold_sq = threshold*threshold;

    float len_sq = dot(z, z);

    for(int i = 0; i < max_iterations; i++)
    {
        z = iter_func(z);

        if((len_sq = dot(z, z)) >= threshold_sq)
            break;
    }

    return sqrt(len_sq);
}

void main()
{
    const ivec2 pixel_coords = ivec2(gl_GlobalInvocationID.xy);
    vec4 z = imageLoad(input_image, pixel_coords);
    const float magnitude = iterate(z);
    const vec4 output_pixel = vec4(magnitude, 0, 0, 0);
    imageStore(output_image, pixel_coords, output_pixel);
}

taby

Author

1,557

April 21, 2020 08:13 PM

Blatant cross post: Good news! Someone with an nVidia GPU ran my app and it works great. So the problem is the AMD driver! How does one even go about talking to AMD about such a thing, I wonder? Thanks again for your help, everyone.

JoeJ

4,405

April 22, 2020 05:33 AM

taby said:
The strange thing, is that it works best when I state: layout(local_size_x = 1, local_size_y = 1) in;

So 63 out of 64 threads on your GPU do nothing. This should never happen - probably it would be faster on CPU then. There must be something wrong…

taby said:
Also, is there such a thing as inline functions in GLSL?

No. I ended up inlining everything myself manually. I have observed, initially when working on a shader i often saw better performance with calling functions. But after optimizing, inlining always ended up faster.

I propose you try to inline everything and see how it changes performance. Also your q*** ops, even if it's tedious.

My current assumption is: Your shader uses much too many registers and has to spill them to video memory. And this makes it so slow. When using only one thread, registers can keep in register file and it is faster.
There seems nothing wrong with your code, but compilers are not so clever as people think, especially on GPU. Changing things to indirectly motivate compiler to do something different often helps.
It could be, for example, the compiler creates 100 unrolled iterations and wastes registers for each of them. This makes not much sense, but i have seen things like this to happen.
You need a profiling application to check register usage, LDS allocation, eventually assembly code. Probably AMD Radeon GPU Profiler is a good choice, or RenderDoc.
Without profiling app, coding on GPU is blindfolded. It is must have.

Can you post q*** implementation, or github link?

taby said:
So the problem is the AMD driver! How does one even go about talking to AMD about such a thing, I wonder?

The devgurus forum site i have linked yesterday. They usually request a repo, take a look at it and fix the bug. If there is one. But you should try harder yourself first before you contact AMD.

I also propose you make max_iterations a compile time cosntant. Maybe ther are proper unroll pragmas meanwilie in GLSL? In this case try to force the compiler to NOT unroll anything. I think NV pragma also works on AMD if not.
(See how it can cause similar issue than yours: https://stackoverflow.com/questions/18557694/glsl-shader-not-unrolling-loop-when-needed )

JoeJ

4,405

April 22, 2020 05:57 AM

taby said:
I tried bumping it up to 32x32

That's a workgroup of 1024 threads, which is too large for good occupancy. (Profiling tools also show occupancy.)

Usually the sweet spots are 256, 64 or 128. So try 16 * 16, but i don't think it's the reason of your problem.

taby

Author

1,557

April 22, 2020 01:41 PM

Thanks for the guidance. I will try the 16*16 sweet spot ASAP.

NikiTo

245

April 22, 2020 01:58 PM

At the time of compiling the shader, DX12 is warning me if i use too much registers. Doesn't OpenGL warn you?

Green_Baron

221

April 22, 2020 02:34 PM

For OpenGL to warn they would need a core and debug context (≥ OpenGL 4.3). With the current mixture of OpenGL 1 to 4 that will not work because they'd allways need the compatibility mode for anything pre-3.x. I do not know if glut offers these possibilities because its development stopped before they were intruduced (hope i am not lying here).

More effort could be spent for error checking in general. I believe (but do not know) that some problems come from the “dog's breakfast” as @taby named it :-) But hey, OpenGL must be pretty robust if that still works alltogether ;-)

Seriously, of course there'd be much better chances for proper debugging if all components would work at the same version/level. Debug context can (I mean "should") be introduced with the use of something like glfw where it is just a switch, a check of availability and two function definitions.

taby

Author

1,557

April 22, 2020 07:12 PM

NikiTo said:
At the time of compiling the shader, DX12 is warning me if i use too much registers. Doesn't OpenGL warn you?

I'm not sure how to retrieve warnings. I do check for errors, and spit out the error buffer to the console.

Green_Baron

221

April 22, 2020 07:29 PM

You must create a debug context and provide the necessary callbacks. This works from 4.3 core profile onwards. Red book pp. 863ff. or learnopengl.com. And https://www.khronos.org/opengl/wiki/Debug_Output

GLFW creates a debug context for you on request. See the GLFW documentation.

taby

Author

1,557

April 23, 2020 12:25 AM

Dang, there we go with GLFW again. For my next project I will use dear imgui and GLFW. For now, I will stick to freeglut one last time. Sniff sniff. :(

local_size_x, etc. in OpenGL compute shader

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

local_size_x, etc. in OpenGL compute shader

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines