Advertisement

local_size_x, etc. in OpenGL compute shader

Started by April 21, 2020 08:09 PM
69 comments, last by NikiTo 4 years, 9 months ago

taby said:
In js_state_machine.cpp there is a statement: string code = eqparser.emit_compute_shader_code(1, 1, fsp.max_iterations);

Ok, then i think i know what your mistake is: There is no further tracking of those numbers (1,1), so it can only be the compute shader runs once per cell or pixel, even if i set it to (16,16). So it does more work for nothing. (It's 0.3 vs. 0.8 seconds for me)
To fix this, you need to run it once per 16 * 16 cells or pixels.

Can't remember the API for this, and ofc. any cell indexing needs to be adjusted as well eventually. I'll try to make this work… (I assume you run the shader on N image slices of a N*N*N 3D volume)

It's glDispatchCompute. I've tried it with 16,16:

It works, but at resolution 100 using 1,1 is till faster and at resolution 256 there is no difference.

I think the runtime of the shader is small in comparison to downloading the data, but still it should not end up slower. I'm puzzled. Looks like a driver issue, or compiler going bad if more threads are active.
I'll try to make a simpler shader…

 

About the DOF, i think it makes no sense because we focus on this single object and so want to see it sharply. But maybe some Bloom would be nice for the really bright spots.

 

Advertisement

I changed CS to caclulate a sphere:

So it does pretty much nothing.

But runtime is almost the same as using the fractal. This time, 16,16 is 0.01 seconds faster than 1,1 : )

I assume the bottleneck on GPU side is downloading hundrets of image slices. I bet you could just calculate the fractal on CPU and it would be faster.

Maybe processing the whole volume and downloading just once would be quite faster, but AGP should be still the bottlenck.

This does not explain the slowdown from using more threads, but at least we see it's pointless to optimize CS performance, and without having accurate timings of the time spent on CS we may accidently confuse other driver inefficiency with bad CS performance.
So i would just use 16,16 to avoid being bullied for havien a 1,1 project on github :D

There is a code at: https://github.com/sjhalayka/julia4d3/raw/master/multithreaded_gpu_2.zip​

This code is the latest version of the multi-threaded code. It allows you to use GPU or CPU mode. For sets where res = 500, the generation is like 2-5x faster on the GPU than on the CPU alone. I find that this GPU mode to be a little slower than the GPU mode in the single-threaded code that does calculations in bursts (the one you're using now). In other words, the GPU works very well at acceleration. I was hoping for like 10-100x speedup, but c'est la vie.

taby said:
I was hoping for like 10-100x speedup, but c'est la vie.

I usually get 50 - 70, depending HW ofc.

To get there, i think you would need to generate the mesh on GPU as well to get rid of the huge downloads. But this means increased complexity, e.g:

One single CS to generate 16^3 block of volume, with 1 cell overlap an each side. Store the density as half float eventually then this fits into 2kB of LDS (or even 8 bit bytes), otherwise it's 4 kB. Workgroup would be 1024 threads large, but becasue you do not read much from memory there is no need to hide memory latency so the low occupancy should not hurt. (Smaller workgroup would result in caclulating more cells twice becasue it causes more overlap.)
Then the same CS generates the mesh. You should see if this can fit into the remaining amount of LDS. (IIRC a 1024 WG can reserve 32 kB). When done, ‘allocate’ necessary main memory with a single atomic add and write the mesh in one go of linear copy from LDS to main ram. (This will hurt your performance the most, because low occupancy the GPU can not switch to another Wavefront while waiting on the memory operations. That's why i propose to create the mesh in LDS.)

To keep it simple, generate pure triangles but no edge connectivity. (If you need this, i would do it on CPU after downloading the triangle mesh.)

Notice you can use a 1D worgroup (1024,1,1), and you do not have to use (16,16,16) just because you work on a volume. The former is often easier to implement any indexing logic.
I do not really know about the complexity of Marching Cubes algo, but i have my own iso surface algorithm which would fit into a single shader together with the fractal generation, so i guess this approach would work for you.

BTW, i thought about reasons for the slow down.

I my own CS project which is about realtime GI, i recently noticed the new GPU is so bored it keeps clocked at 170 MHz. I think this is because i do heavy downloads as well each frame for display and debug reasons, so shader cores are mostly idle.
Using only 1 thread will cause much more wavefronts to be active, and maybe this triggers a higher clock.

Another reason could be the time spent on compiling the shader eventually.

So it does not have to be any kind of driver issue or bug at all.

Advertisement

Yeah, I have a Vega 3 LOL. My mom's adding machine is pretty much faster than a Vega 3.

Thanks for all of your help and time, investigating the code. ?

I updated the code. It now includes a Use GPU checkbox. On my Vega 3 the GPU-based and CPU-based code are about equal in speed. :(

@taby When you change the worksize from 1 to 16x16, do you change the code of the shader too?

In my case, changing any of the 6 dimensions of the workload, requires the shader's code to be rewritten.

You could be repeatedly computing the same work 16x16 times. Is this your case?

Sorry i really am lazy to read the code of other people. I read the weird symptoms you give here, and wonder.

Making only one thread work is wrong. If i were you i would start by plain deleting that. Just put some worksize multiple of the wavesize to start with, and recode your shader. Stop comparing the timings. Making only one thread work on the GPU is plain WRONG! Delete it and start nice over.

This topic is closed to new replies.

Advertisement