Advertisement

OpenGL compute shader problem

Started by September 07, 2019 05:40 PM
46 comments, last by taby 5 years, 4 months ago
23 hours ago, JoeJ said:

This means only one thread of a wavefront / warp of 64 / 32 threads will do any work, the majority of your GPU will do nothing.

(Maybe you do it just for testing, but worth to mention - unlike PS y u are responsible to saturate threads yourself with CS.)

OK, I changed the compute shader to use a local size of 32, but the performance sucks compared to a local size of 1. Does a local size of 1 signify that the compiler should automatically set the local size? That's the only thing I can think of.

The code is nearly finished:

https://github.com/sjhalayka/qjs_compute_shader

One thing left to do is change the output image from RGBA to RED or ALPHA.

11 hours ago, taby said:

OK, I changed the compute shader to use a local size of 32, but the performance sucks compared to a local size of 1.

I guess you do something wrong - this should never happen.

I assume you now accidently do the same work 32 times, causing 32 x bandwidth for nothing, or something similar.

11 hours ago, taby said:

Does a local size of 1 signify that the compiler should automatically set the local size?

No, the compiler can't do this.

 

I always recommend the chapter from OpenGL Super Bible to learn about CS. It's really great and explains all the important things quickly:

Indexing - Work groups, global vs. local thread / work indices (This can be confusing! Pretty sure it's your problem actually.)

Parallel algorithms (scan operations like prefix sums, summed area tables). 

LDS usage (which turns CS into something completely different from 'dumb' pixel shaders)

 

 

Quick example on indexing, imagine we have 1024 workitems.

 

Using workgroup size of 1 as you do, it should work like this:

CU0: work index 0 (only one thread out of 64 does any work)

CU1: work index 1 (only one thread out of 64 does any work)

CU2: work index 2 (only one thread out of 64 does any work)

...

 

But what you want is a workgroup size of at least 64:

CU0: work index 0-63 (all threads busy)

CU1: work index 64-127 (all threads busy)

CU1: work index 128-191 (all threads busy)

...

So you get what i mean and it is simple. But APIs and the option to have 2D/3D indexing as well makes it confusing to get it right at first.

 

 

 

 

 

 

Advertisement

I looked through the Red Book, the Blue Book, and the Orange Book and they do explain what the local size does, however, I am stumped as to why a local size of 1 would be faster (like 10x at least) than a local size of 32. Perhaps it's because my graphics processor is very lacklustre (Ryzen 3) in terms of the number of shader units it has?

I finished the code:

https://github.com/sjhalayka/qjs_compute_shader

Can anyone download and compile the code to see if the same thing happens on a different graphics processor?

 

 

 

The thing of it is that there is a loop in the shader, and the loop is iterated up to 8 times, but sometimes less. So sometimes a local group gets delayed while waiting for the iterations to complete. Setting the local size to 1 somehow gets rid of that delay, or at least reduces it to a minimum.

When increasing workgroup size to 8*8, do you also decrease dispatch accordingly? like:

glDispatchCompute((GLuint)tex_w / 8, (GLuint)tex_h / 8, 1);

layout(local_size_x = 8, local_size_y = 8) in;

I guess something like this is the problem. Can't compile myself easily because of missing libs. But Ryzen3 has GCN GPU and i'm sure that can't be an issue. (It's years since i've used OpenGL compute, so not sure the above code is correct - some APIs take numbers of workgroups, others number of threads for the dispatch.)

2 hours ago, taby said:

The thing of it is that there is a loop in the shader, and the loop is iterated up to 8 times, but sometimes less. So sometimes a local group gets delayed while waiting for the iterations to complete. Setting the local size to 1 somehow gets rid of that delay, or at least reduces it to a minimum.

Very unlikely, even if only one thread per CU has 8 iterations while all others have 1, utilizing all threads should be much faster.

Maybe the entire workload is so small some kind of overhead dominates, which causes strange performance, but i doubt it.

What i sometimes do is to increase an global atomic counter from each thread to be sure the number of threads is as intended.

 

Good news. Thank you again for all of your help, man. I changed it up so that the local size is 32, and I reduced the values passed into glDispatchCompute by 32, and it works... for the most part. There is a problem though, as it puts a half of a border into the output:

out.tga

Note that the problem occurs only when the texture size is a non-power of two.

Advertisement
6 hours ago, taby said:

Note that the problem occurs only when the texture size is a non-power of two.

Yeah, that's a typical problem requiring to add range checks. But only necessary if image dimension is not a multiple of workgroup dimension (So power of two is not exactly the requirement).

Glad it works now :)

On 9/11/2019 at 10:48 PM, JoeJ said:

Can't compile myself easily because of missing libs.

Are GLUT and GLEW really that bad, that you don't have them installed? Do you know of any alternatives?

I would look at GLFW - it cares for both windows and extensions and also supports Vulkan if you consider to use it in the future.

But never tried this myself yet, and as long as GLUT and GLEW work for you i don't see a need to change it. (I was just too lazy to download / set them up for the 10th time... :) )

 

BTW, the move to VK made quite a difference for me, as i work almost only on compute shaders on GPU.

I started with OpenGL, maybe a year after CS were introduced, but performance on NV was really terrible, only AMD was fine.

So i moved to OpenCL1, which was (surprisingly) twice as fast on NV and also 10% faster on AMD back then.

VK then gave me another speedup of almost two, because of static command buffers (missing from GL) and indirect dispatch (missing from CL). 

However, i don't know how actual OpenGL performance has improved after all those years, and VK is no fun ;)

 

 

 

What are static command buffers?

What's the point of using glDispatchComputeIndirect?

This topic is closed to new replies.

Advertisement