Advertisement

Replacing glCopyImageSubData

Started by November 23, 2022 11:24 PM
22 comments, last by taby 2 years, 1 month ago

I'm currently using glCopyImageSubData to copy from texture to texture. It works fine, but I'm trying to replace it with the following code. It doesn't work, failing on the copy back to the GPU.

	glCopyImageSubData(glowmap_tex, GL_TEXTURE_2D, 0, 0, 0, 0,
		last_frame_glowmap_tex, GL_TEXTURE_2D, 0, 0, 0, 0,
		win_x, win_y, 1);

	vector<float> output_pixels(win_x* win_y * 4, 1.0f);
	glActiveTexture(GL_TEXTURE4);
	glBindTexture(GL_TEXTURE_2D, glowmap_tex);
	glBindImageTexture(GL_TEXTURE4, glowmap_tex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA32F);
	glGetTexImage(GL_TEXTURE_2D, 0, GL_RGBA, GL_FLOAT, &output_pixels[0]);

	vector<float> last_frame_output_pixels(win_x* win_y * 4, 1.0f);
	glActiveTexture(GL_TEXTURE4);
	glBindTexture(GL_TEXTURE_2D, last_frame_glowmap_tex);
	glBindImageTexture(GL_TEXTURE4, last_frame_glowmap_tex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA32F);
	glGetTexImage(GL_TEXTURE_2D, 0, GL_RGBA, GL_FLOAT, &last_frame_output_pixels[0]);

	vector<float> combined_output_pixels(win_x* win_y * 4, 1.0f);

	for (int x = 0; x < win_x; x++)
	{
		for (int y = 0; y < win_y; y++)
		{
			size_t index = 4 * ((y * win_x) + x);

			combined_output_pixels[index + 0] = output_pixels[index + 0];// +last_frame_output_pixels[imgIdx + 0];
			combined_output_pixels[index + 1] = output_pixels[index + 1];// +last_frame_output_pixels[imgIdx + 1];
			combined_output_pixels[index + 2] = output_pixels[index + 2];// +last_frame_output_pixels[imgIdx + 2];
			combined_output_pixels[index + 3] = output_pixels[index + 3];// +last_frame_output_pixels[imgIdx + 3];
		}
	}
	
	// The following doesn't work, and I don't know why
	glActiveTexture(GL_TEXTURE4);
	glBindTexture(GL_TEXTURE_2D, last_frame_glowmap_tex);
	glBindImageTexture(4, last_frame_glowmap_tex, 0, GL_FALSE, 0, GL_READ_ONLY, GL_RGBA32F);
	glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F, win_x, win_y, 0, GL_RGBA, GL_FLOAT, &combined_output_pixels[0]);

Any ideas?

taby said:
Any ideas?

No, but you could work on the proper solution instead spending time on a work around.
Upload / Download to / from GPU is very expensive, and single threaded image processing on CPU is slow too. You want to do the whole thing on GPU alone.

Advertisement

I've tried all of the other solutions, without luck. This is, of course, only a test, temporary.

OK, I put the shader at https://github.com/sjhalayka/obj_ogl4/blob/aed9d040cb09b2db67a07ebae17a7dafb5a68846/ortho_reflectance.fs.glsl#L44​ and the C++ code at https://github.com/sjhalayka/obj_ogl4/blob/aed9d040cb09b2db67a07ebae17a7dafb5a68846/main.cpp#L472

All I need is a little guidance. Do I need to render to a texture?

taby said:
Do I need to render to a texture?

No. The easiest and fastest way should be to use a compute shader, writing and reading to texels directly.
CS has advantages if you do complex image processing. E.g. we want depth aware DOF. Then you can load a tile of texture to LDS memory.
Then all threads can access this cached image data without having to access VRAM at all.
After your complex processing is done, you write back to VRAM. So you access VRAM only two times.
Contrary, a pixel shader would need to sample the texture from VRAM constantly in its inner loop, which likely is cached as well, but in many cases the LDS approach will win.
However, a downside is that you can not use texture filter HW on LDS memory, so if you need texture filter LDS approach becomes less attractive.

On the API side there are some caveats:
The texture needs to uncompressed.
Likely you need two textures - one to read and another to write the results. (Otherwise threads would randomly read either original values, or changed values form other threads)
The driver must know which textures are readonly or writable for which shaders, so barriers, synchronization, resource transitions can be handled.

Maybe this tutorial gives all the details needed: https://learnopengl.com/Guest-Articles/2022/Compute-Shaders/Introduction


OK, first off: you're a certifiable genius at graphics programming!

I see your solution now:

  • Pass two textures into the compute shader
  • Accumulate them in the shader, or whatever
  • Write to temporary texture at the end of the shader
  • Copy finalized temporary texture to last frame's texture using glCopyImageSubData

Does this sound reasonable?

And here I was concocting a spell using the arcane FBO and all that stuff. LOL

P.S. I have a compute shader example on my GitHub too: https://github.com/sjhalayka/qjs_compute_shader

taby said:
I see your solution now: Pass two textures into the compute shader Accumulate them in the shader, or whatever Write to temporary texture at the end of the shader Copy finalized temporary texture to last frame's texture using glCopyImageSubData Does this sound reasonable?

I think you don't need a temporary texture. Because actually you have no kernel reading adjacent pixels, each pixel is accessed only by exactly one thread. (For this reason, there is no point to cache to LDS either. Pixel shader could do it too, but requires a cumbersome triangle to bound the texture rectangle just to map threads to all texels.)

So this should already work: prevTex[i] = prevTex[i] + curTex[i];

But i'm not sure if we can read and write to the same texture. I guess so, but i rarely worked with textures, so idk.

Btw, just saw ‘slow’ code here:

	for (int x = 0; x < win_x; x++)
	{
		for (int y = 0; y < win_y; y++)
		{
			size_t index = 4 * ((y * win_x) + x);

			combined_output_pixels[index + 0] = output_pixels[index + 0];// +last_frame_output_pixels[imgIdx + 0];

Notice this processes the image ‘vertically’, so the stride of memory access is big.
It should be faster if you reverse the loops:

	for (int y = 0; y < win_y; y++)
	{
		for (int x = 0; x < win_x; x++)
		{
			size_t index = 4 * ((y * win_x) + x);

			combined_output_pixels[index + 0] = output_pixels[index + 0];// +last_frame_output_pixels[imgIdx + 0];

Now we process horizontally, stride is one, and access pattern is ideal.

Advertisement

Thank you again, for all of the advice. I tried avoiding using a temporary texture, but it's not working. I also tried to use a temporary texture, but the same result: temp_tex ends up with all 0s.

Here is the C++ code now! So simple, thanks to you.


	glowmap_copier.use_program();

	glActiveTexture(GL_TEXTURE0);
	glBindTexture(GL_TEXTURE_2D, last_frame_glowmap_tex);
	glUniform1i(glGetUniformLocation(glowmap_copier.get_program(), "output_image"), 0);

	// activate glow and last frame glow input textures
	glActiveTexture(GL_TEXTURE1);
	glBindTexture(GL_TEXTURE_2D, glowmap_tex);
	glUniform1i(glGetUniformLocation(glowmap_copier.get_program(), "inputa_image"), 1);

	glActiveTexture(GL_TEXTURE2);
	glBindTexture(GL_TEXTURE_2D, last_frame_glowmap_tex);
	glUniform1i(glGetUniformLocation(glowmap_copier.get_program(), "inputb_image"), 2);

	// call compute shader
	glDispatchCompute((GLuint)win_x, (GLuint)win_y, 1);

	// Wait for compute shader to finish
	glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);

The shader is:

// OpenGL 4.3 introduces compute shaders
#version 430

layout(local_size_x = 1, local_size_y = 1) in;

// Single-channel output
layout(binding = 0, rgba32f) writeonly uniform image2D output_image;
layout(binding = 1, rgba32f) readonly uniform image2D inputa_image;
layout(binding = 2, rgba32f) readonly uniform image2D inputb_image;


void main()
{
	// Get global coordinates
	const ivec2 pixel_coords = ivec2(gl_GlobalInvocationID.xy);
	const vec4 output_pixel = imageLoad(inputa_image, pixel_coords) + imageLoad(inputb_image, pixel_coords);

	imageStore(output_image, pixel_coords, output_pixel);
}

Maybe it fails because you map units 0 and 2 to the same texture.
But i guess you already tried to use a real additional temporary texture as well?

So yeah, this is what sucks with GPU programming. We never know why it doesn't work.

Edit: Checking for GL errors on API side might help.

Yeppers, this is what I'm doing now:


	// create output temp texture, with texstorage
	GLuint temp_tex;

	glGenTextures(1, &temp_tex);
	glActiveTexture(GL_TEXTURE0);
	glBindTexture(GL_TEXTURE_2D, temp_tex);
	glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
	glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);
	glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
	glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
	glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F, win_x, win_y, 0, GL_RGBA, GL_FLOAT, NULL);
	glBindImageTexture(0, temp_tex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA32F);


	glActiveTexture(GL_TEXTURE0);
	glBindTexture(GL_TEXTURE_2D, temp_tex);
	glUniform1i(glGetUniformLocation(glowmap_copier.get_program(), "output_image"), 0);

	// activate glow and last frame glow input textures
	glActiveTexture(GL_TEXTURE1);
	glBindTexture(GL_TEXTURE_2D, glowmap_tex);
	glUniform1i(glGetUniformLocation(glowmap_copier.get_program(), "inputa_image"), 1);

	glActiveTexture(GL_TEXTURE2);
	glBindTexture(GL_TEXTURE_2D, last_frame_glowmap_tex);
	glUniform1i(glGetUniformLocation(glowmap_copier.get_program(), "inputb_image"), 2);

	// call compute shader
	glowmap_copier.use_program();
	glDispatchCompute((GLuint)win_x, (GLuint)win_y, 1);

	// Wait for compute shader to finish
	glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);

	// copy from temp to last frame using glCopyImageSubData
	glCopyImageSubData(temp_tex, GL_TEXTURE_2D, 0, 0, 0, 0,
		last_frame_glowmap_tex, GL_TEXTURE_2D, 0, 0, 0, 0,
		win_x, win_y, 1);

	glDeleteTextures(1, &temp_tex);

and

// OpenGL 4.3 introduces compute shaders
#version 430

layout(local_size_x = 1, local_size_y = 1) in;

layout(binding = 0, rgba32f) writeonly uniform image2D output_image;
layout(binding = 1, rgba32f) readonly uniform image2D inputa_image;
layout(binding = 2, rgba32f) readonly uniform image2D inputb_image;


void main()
{
	// Get global coordinates
	const ivec2 pixel_coords = ivec2(gl_GlobalInvocationID.xy);
	const vec3 output_pixel = imageLoad(inputa_image, pixel_coords).rgb + imageLoad(inputb_image, pixel_coords).rgb;

	imageStore(output_image, pixel_coords, vec4(output_pixel, 1.0));
}

Great.
But you know what i have to say about this:

taby said:
layout(local_size_x = 1, local_size_y = 1) in;

;D

This topic is closed to new replies.

Advertisement