The impossible mission of 1 pixel thick lines GL_LINES

Aybe One · 2024-09-09T16:59:53

Basically, I wanted to draw a waveform like Sound Forge or Wavelab does, 1 pixel thick: You can't get this with Bresenham, instead, you only draw vertical lines with this logic: if (max < next.Min){max = next.Min;} if (min > next.Max){min = next.Max;} So far, this work on a good old 2D surface, but with OpenGL it ended up being an impossible mission… Wrote a small program to try understand this insanity and inferred the following: // Lines EXACTLY 1 pixel thick, with or without MSAA://// 1x1 square://// X1 = 10.0// Y1 = 10.0// X2 = 11.0// Y2 = 11.0//// 2x1 horizontal://// X1 = 20.0// Y1 = 20.5// X2 = 22.0// Y2 = 20.5//// 1x2 vertical://// X1 = 30.5// Y1 = 30.0// X2 = 30.5// Y2 = 32.0//// 2x2 diagonal://// X1 = 40.0// Y1 = 40.0// X2 = 42.0// Y2 = 42.0 This works but that's theory, reality is different. Turns out it's impossible to get 1 pixel thick lines working all the time. Wrote another test program: Notice how the 2nd line end pixel isn't green… If you make the line 1.1 pixel wide!, you get green although it's not the end color… Also, you simply cannot use GL_LINES, you must use GL_LINESTRIP else you have empty lines: But then GL_LINESTRIP thickness isn't even, unless one goes 4K it's noticeable: Trying to enable MSAA simply moves the problem somewhere else + free artifacts. Tried tons of things: adding +/-0.5 offsets, turns out for Y axis it's a no go ensuring a Y1 ≠ Y2 so that lines are always visible, not good overall wrote a shader that ‘undo’ MSAA pixels, it works but to some extent only wrote a 2D surface that does Bresenham with Burst, works, is fast but chokes on 4K… looked at rasterization rules GL and DX (hey I just wanted to draw lines) What have we become? We have RTX 4050 my ass but can't instruct it to draw precise 2D lines!!?? Can do that in QBASIC in like 30 seconds, it just works... I am not crazy, right? Or am I? Any clues are welcome, thanks!

Graphics and GPU Programming Programming

Started by Aybe One August 30, 2024 09:45 PM

35 comments, last by Aybe One 5 months, 1 week ago

Aybe One

Author

September 03, 2024 01:29 AM

Bad news…

I switched to 3840x2160 to “assess” the power of compute shaders, got an amazing 5 FPS…
Profiled, switched to ComputeShader.DispatchIndirect, it's smooth but GPU use is now 95%!!!*

Tried to change thread groups sizes, send nothing, call itself is still as slow as before…
Doing nothing but assigning color in shader brings it down to 35% usage but that's useless.

Only thing I need is to draw 3840 vertical lines but that appears to be a problem today, lol.
This, beside the hackish approach I've posted before using GL class which works fast.

Honestly, I don't know what to think at this point…

* it is smooth but only renders 1 group (256x256), not sure why, not sure it's worth figuring it out either

frob

46,329

September 03, 2024 05:26 AM

I need is to draw 3840 vertical lines but that appears to be a problem today,

Nothing new to rendering.

Be sure you are making one call to render 3840 segments, not 3840 calls to render 1 segment each.

And again, make sure you learn about diamond exit that you keep avoiding. It is not a pixel placement rendering to a frame buffer, make sure your line segments actually fill pixels.

JoeJ

4,418

September 03, 2024 07:21 AM

Aybe One said:
I switched to 3840x2160 to “assess” the power of compute shaders, got an amazing 5 FPS… Profiled, switched to ComputeShader.DispatchIndirect, it's smooth but GPU use is now 95%!!!*

You iterate over ALL lines in each thread of each invocation of your shader:

for (int i = 0; i < LinesCount; i++)

Int other words: For each of (all) 8 million pixels iterate over 4 (all) thousand lines. /:O\

This is like saying ‘hey, i have 4k little cores - let's make sure that each of those cores processes the same and as much work as possible’.

Ofc. it's slow. It's as inefficient as possible. You deserve the bad results to remind you that you should do at least a bit of optimization.

A trivial approach would be:

Notice each invocation draws a tile of 8x8 pixels.

1. Bin the lines to those tiles, so each tile gets its own list of lines intersecting this tile.
Technically this can be done with a shader processing one line per thread, increasing a atomic counter per tile in VRAM. That's not the fastest option, but it's simple and should be fast enough.

After the counts for all tiles (or ‘bins’) are known, calculate a prefix sum over the counters to know the amount of memory you need to store all per tile lists. So each tile knows it's beginning and end of it local list of lines.
Technically this requires one thread per tile, so a dispatch of (horizontal res / 8) x (vertical res / 8) threads to increase the counters.
If it's only 4K lines, i would do the prefix sum in a single workgroup of maximum size (which is 1024), which should be faster and much simpler than using more threads by using and synchronizing multiple workgroups but increasing the number of disptaches as well.
At this point you can also set up indirect dispatches for the tiles which are not empty for the next step.

Finally, draw the lines. We get one invocation for each tile that is not empty, and we have a list of lines covering this tile. So we draw the lines, but we need to clip them so we do not draw pixels multiple times from lines that appear in multiple segments.

Alternatively we could make sure each line appears only on one bin, by treating it as a point (center or first point of a line segment). Then there is no need for clipping and one thread just draws one whole line.
I guess this would be a bit faster if all your segments have a similar length.

The whole process is very similar to the common example of binning many lights to a low res screenspace grid, often done in deferred rendering to handle large amounts of lights.
It's a good exercise to learn some parallel programming basics, but it's still a serious effort just to draw some lines. So if you have no interest in GPU and parallel programming right now, ther should be still another way that is fast enough.
I see you only use vertical lines. But then it should be possible to match your compute reference with standard line rendering. Since they are all vertical, fill rules and subpixel conventions should not cause issues or confusion i would assume.

Another property (which i have ignored in my general proposal above) is the fact that your lines are probably already sorted horizontally, and that for each vertical row of pixels there will be only one pixel drawn.
This property should allow for a simpler and specific way to optimize, ideally nor requiring multiple dispatches for binning (which always is some kind of sorting as well).
That's surely the way to go if conventional triangle reasterization indeed fails.

JoeJ

4,418

September 03, 2024 07:54 AM

…thinking of it, the simplest compute solution would be probably:

Create one GPU thread per line, and draw the complete line within this single thread using a loop.
Only one dispatch is needed, but no binning and no regular grid acceleration structure.

Potential problems:

It's eventually inefficient, e.g. if all segments have a length of 3 pixels, but one line has a length of 1000 pixels. But i guess this can't happen to you.

Multiple threads might attempt to draw at the same pixel at the same time, creating flicker due to write hazards. So you probably need to use atomicMax to draw a pixel.

Aybe One

Author

September 03, 2024 03:54 PM

frob said:
I need is to draw 3840 vertical lines but that appears to be a problem today,
Nothing new to rendering.
Be sure you are making one call to render 3840 segments, not 3840 calls to render 1 segment each.
And again, make sure you learn about diamond exit that you keep avoiding. It is not a pixel placement rendering to a frame buffer, make sure your line segments actually fill pixels.

Yes… I think I am starting to understand that N calls issue for the compute shader, specifically…

I looked at the diamond exit rule but the OpenGL specs aren't very helpful (page 65)…

But one thing they've said, is that an implementation might differ, more on that later.

Looking at DirectX rasterization rules, it's already clearer as there's an image:

illustration of examples of aliased line rasterization

I have then been able to “live-test” using that handy line-rasterization project:

So in practice, I do the following for a pixel-perfect vertical line but with extra vagueness:

X fractional part must be 0.5 else it overlaps between 2 pixels, OK
Y2 must be Y1 + line height, i.e. 1 extra pixel that won't get drawn, OK
- BUT that extra for Y2 can be from ~0.75 to ~1.25, WEIRD

In the end, the rules I inferred by looking at somewhat helpful docs I could find are no different than what I found by manually experimenting, i.e. the problem is solved although I don't fully get it I admit…

Also, to me, GL.LoadPixelMatrix is doing something special we're not told. To confirm this, I did a manual orthogonal projection and although it works, it isn't stable as theirs. Either result differs on MSAA VS no MSAA, or the rules for horizontal/vertical lines have to be swapped... No idea what but it's doing something.

In the end, I stick to GL.LoadPixelMatrix and abide by the rules it has put in place.

To be continued… 🤣

Aybe One

Author

September 03, 2024 04:09 PM

JoeJ said:
Aybe One said:
I switched to 3840x2160 to “assess” the power of compute shaders, got an amazing 5 FPS… Profiled, switched to ComputeShader.DispatchIndirect, it's smooth but GPU use is now 95%!!!*
You iterate over ALL lines in each thread of each invocation of your shader:
for (int i = 0; i < LinesCount; i++)
…

Yes, yes and yes… 🤣

I yet have to fully digest these suggestions to try come up with a fix.

However, after quick-testing, I have conceptual issues I have hard time with:

in my case, I assume I can make [numthreads(128, 1, 1)] since it's horizontal work
- but with that, only the 1st scan-line is drawn…
- so I must increase Y threads but then GPU usage is up again…
although I tried segmenting the loop which immediately reduced GPU usage
- the latter threads are still processing the first lines (guess)
- so, the gain is there but it isn't amazing (should be as far as I understand)
currently dissecting a frame with RenderDoc, still getting acquainted with it

So yes, as you mentioned, I clearly overlooked that part, looking into it.

Meanwhile, I improved the initial GL approach and it's much more solid, that'll be my backup plan.

But I must get that compute shader right, to learn something new that should be useful in the future.

JoeJ

4,418

September 03, 2024 06:11 PM

Aybe One said:
in my case, I assume I can make [numthreads(128, 1, 1)] since it's horizontal work but with that, only the 1st scan-line is drawn…

This 3D grid to index the workload can be very confusing.
Personally i have never used the 2nd and 3rd indeces. I always use just the first as quoted. But for image / volume processing the 3D index might help with ideal memory access patterns. At least i guess that's why they use 3D indices.

However, to make your exampel work, you then need to do your own 1D index → 2D pixel coords mapping. I'll try:

[numthreads(64, 1, 1)]
void CSMain(uint3 id : SV_DispatchThreadID)
{
	//int2 xy = id.xy;

	int2 xy = int2(id.x & 7, (id.x>>3)); // edit: fixed typo y->x // but likely it does not work, having confused global with local thread id

	Result[xy] = Clear;

	for (int i = 0; i < LinesCount; i++)
	{
		const Line data = Lines[i];

		if (xy.x == data.X1)
		{
			const int y1 = data.Y1;
			const int y2 = data.Y2;

			if (xy.y >= min(y1, y2) && xy.y <= max(y1, y2))
			{
				Result[xy] = data.Color;
			}
		}
	}
}

After that it should work as before with 64 threads. But perf should not really change.

Just FYI.

JoeJ

4,418

September 03, 2024 07:50 PM

[numthreads(64, 1, 1)]
void CSMain(uint3 id : SV_DispatchThreadID)
{

	int i = id.x; // assuming this gives the global thread index

	if (i < LinesCount) // check is needed because some threads of the last workgroup will exceed, if LinesCount is not a multiple of 64 

	{
		//int2 xy = id.xy;

		//Result[xy] = Clear; // no - you must clear it before with its own dispatch
. here we only draw line pixels
	
		const Line data = Lines[i];

		
			const int x = data.X; // assuming you have that
			const int y1 = data.Y1;
			const int y2 = data.Y2;

			//if (xy.y >= min(y1, y2) && xy.y <= max(y1, y2))
			for (int y=y1; y<y2; y++)
				Result[int2(x, y)] = data.Color;

	}		
}

I thought i could rewrite your shader as said above.

JoeJ said:
…thinking of it, the simplest compute solution would be probably: Create one GPU thread per line, and draw the complete line within this single thread using a loop. Only one dispatch is needed, but no binning and no regular grid acceleration structure. Potential problems: It's eventually inefficient, e.g. if all segments have a length of 3 pixels, but one line has a length of 1000 pixels. But i guess this can't happen to you. Multiple threads might attempt to draw at the same pixel at the same time, creating flicker due to write hazards. So you probably need to use atomicMax to draw a pixel.

Each thread drawns one line, so your dispatched size needs to be enough threads for all lines.
If all your lines are short or have similar length, this should e be the best solution and it's also the simplest one.

Aybe One

Author

September 03, 2024 08:02 PM

LOL, I was replying while you posted V2 so I thought I'd give it a try first:

Artifacts are okay but unfortunately GPU usage was ~85%.

Anyway… just had a revelation a few minutes ago…

Not sure what I had in mind before but this works perfectly:

[numthreads(32, 32, 1)]
void CSMain(uint3 id : SV_DispatchThreadID)
{
	int2 xy = id.xy;

	Result[xy] = Clear;

	const Line data = Lines[xy.x];

	if (xy.x == data.X1)
	{
		const int y1 = data.Y1;
		const int y2 = data.Y2;

		if (xy.y >= min(y1, y2) && xy.y <= max(y1, y2))
		{
			Result[xy] = data.Color;
		}
	}
}

Perfectly… or almost… GPU is ~42% for drawing 3840 lines. 🤣

More or less feeling I had, that it'd work but would be very hungry for the outcome, indeed…

I think I'll polish my cheap approach with GL, it literally works for free.

JoeJ

4,418

September 03, 2024 08:25 PM

Aybe One said:
Perfectly… or almost… GPU is ~42% for drawing 3840 lines. 🤣

Wait - you think it's an improvement becasue it does the same thing with less GPU utilization?

I think i can explain why it goes down.
By using 32*32*1, your workgroup size is at the maximum of 1024 threads.
The problem with large workgroups is that now a whole CU is needed for a single invocation, and so it can no longer hide VRAM latency by switching to other invocations while waiting on the memory.

So i guess you just made it slower? If so, you should use timesamps or a profiling tool instead observing utilization percentage.

Aybe One said:
I think I'll polish my cheap approach with GL, it literally works for free.

Personally i would surely be happy with a linestrip. With AA it should also give better quality. But i never noticed missing pixels artifacts when i did this.

However - the proposed ‘line per thread’ compute shader might be faster, since there is no need for triangle setup, culling, edge rasterization.

The impossible mission of 1 pixel thick lines GL_LINES

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

The impossible mission of 1 pixel thick lines GL_LINES

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines