I'm a programmer with barely any art skills. I've been playing around with software rendering for a couple years, and realized that using a lower bbp provides performance enhancement opportunities in some of the slowest parts of the pipeline.
But just dropping bits off RGB(A)888(8) would obviously cause some deterioration in quality. Art would need to be created in these formats. From Google, I see RGB565 is used sometimes in mobile development, but I can't really find much on the other formats. So I need to ask game artists. Are people ever using them? Is it practical to produce relatively modern quality textures and pixel art with these pixel formats?
Anybody use RGB565, RGB555, or RGBA4444?
We used 565 two generations of consoles ago
It was common before about the year 2000 for PC games to ask you if you wanted to run in 16 or 32 bit mode -- often that was choosing between an 8888 back buffer and a 565 backbuffer.
555 is very similar in quality, but just wastes one bit to make implementation symmetrical between the channels -- some hardware may implement that instead of 565.
4444 is another 16bit format that would be used when you needed an alpha channel.
You don't really see these formats used any more; they're pretty damn horrible. Even at the time, you'd get better quality with an 8bit palettized texture (8bpp, pointing to an external 256 entry 888 colour palette), though not all hardware supported that... These days you could trivially implement 8bit palettized textures in a pixel shader if you wanted to
Art tools generally support 256-colour mode much more than they do 565/555/444 colour mode...
Also, we now get the same performance enhancement by using block-compression texture formats instead of low-bit-count formats.
e.g. DXT1 has been the standard texture format for over a decade now -- usually artists author 888 content, and then a tool compresses this into 4x4 pixel blocks, where each block stores two 565 colour values and 16 two-bit interpolation values (i.e. a 4x4 grid of values, where each cell holds either 0%, 33%, 67% or 100%). Each pixel in the 4x4 block gets it's colour by blending the two 565 values using that pixel's interpolation value -- resulting in almost 888 quality in just 8bpp. This is all done in hardware such that it's "free" (obviously not in a software renderer though).
This format looks great for general game textures, however, not so great for "pixel art" with very strong linework, due to it only being able to store one simple colour gradient per 4x4 pixels.
Direct3D11/OpenGL4 added a lot more kinds of block-compression formats, so DXT1/3/5 (now called BC1/2/3) aren't necessarily your only choices any more either.
. 22 Racing Series .
We used 565 two generations of consoles ago laugh.png
Only two? I figured at least 3. :P
You don't really see these formats used any more; they're pretty damn horrible.
I've seen some reasonably good quality photographs that had been converted to RGB565 with dithering, so I was thinking it would be suitable, I just really have nothing to back this up. If, for example, most PS2 games used RGB565 textures, then this tells me that they would probably be suitable for my use case - I wouldn't expect my software renderer to obtain the graphical quality of modern consoles at real time frame rates anyway, why have high-quality textures?
These days you could trivially implement 8bit palettized textures in a pixel shader if you wanted to
I support 8-bit palettes without pixel shaders, but by its nature this winds up having essentially the same performance as an RGB texture, just with less memory use. It's not giving me more frames per second.
e.g. DXT1 has been the standard texture format for over a decade now -- usually artists author 888 content, and then a tool compresses this into 4x4 pixel blocks, where each block stores two 565 colour values and 16 two-bit interpolation values (i.e. a 4x4 grid of values, where each cell holds either 0%, 33%, 67% or 100%). Each pixel in the 4x4 block gets it's colour by blending the two 565 values using that pixel's interpolation value -- resulting in almost 888 quality in just 8bpp. This is all done in hardware such that it's "free" (obviously not in a software renderer though).
This would probably improve bilinear filtering performance, but it still doesn't reduce the number of ops for "imperfect" alpha blending, or for trilinear filtering like actually using RGB565 would.
The real issue is that the optimal bilinear filtering case (no colorkey) costs me about 30ns per pixel (with SSE2 optimization in assembly). By comparison, true perspective correction (thanks to carefully arranging operations to get the most out of superscalar execution) only costs me about 10ns per pixel. When I render at 640x480, I get usable performance - about 70fps. But at 1024x768, I'm stuck at about 25fps. These are with bilinear filtering enabled and true perspective correction, rendering on a fixed scene with no alpha blending; so they're really "ideal" stats - I can expect real-world performance to be at least 20% worse.
I could probably do per-scanline concurrent rendering and get better performance on multi-core machines, but I'm trying to avoid doing something like that.
Honestly my SSE2 implementation of bilinear filtering is only about 5ns faster than my default C implementation (possibly because it requires a function call). A slight tweak for RGB565 to the default C implementation makes it faster than the SSE2 implementation by 3ns - it's not much, but it's a couple extra frames.
Only two? I figured at least 3.
Programmer counting. Current = 0. Prevgen = -1. Old = -2
Yeah, Wii/PS2 type stuff.
If, for example, most PS2 games used RGB565 textures, then this tells me that they would probably be suitable for my use case
Sounds good then. Give it a go
If you do implement a 565 backbuffer, you'll probably also want to apply dithering right before you go from higher precision down to 565. Naive compression into that format does create some quite awful colour banding. There's particular awful shades of dark green and purple that are almost the signature of 565 backbuffers
Do you do your shading math in floating point?
For texture sampling, you can probably come up with some fancy SSE to splat the same 565 value across three 32bit SSE lanes, shift right by {0,5,11}, mask by {0x1F, 0x3F, 0x1F}, convert to float, and multiply by {1.0f/0x1F, 1.0f/0x3F, 1.0f/0x1F} to end up with float RGBX in an SSE register.
Do you use SSE for individual optimizations like this, or do you use it more generally to process four pixels at a time?
I could probably do per-scanline concurrent rendering and get better performance on multi-core machines, but I'm trying to avoid doing something like that.
It's a good use of extra cores if they're currently sitting idle. Splitting the screen into 4 segments should give a near perfect 300% boost on modern machines!
. 22 Racing Series .
Do you do your shading math in floating point?
For texture sampling, you can probably come up with some fancy SSE to splat the same 565 value across three 32bit SSE lanes, shift right by {0,5,11}, mask by {0x1F, 0x3F, 0x1F}, convert to float, and multiply by {1.0f/0x1F, 1.0f/0x3F, 1.0f/0x1F} to end up with float RGBX in an SSE register.
I use floats (to get the advantage of SSE ops for transforms) up until I feed triangles into the rendering functions (which actually just implement a scan line based "span list" to prevent overdraw, a later function does the actual work).
I convert vertices to a fixed point format with 15 bits of precision, which is what is used inside the rasterizer to minimize the number of expensive float->int conversions.
I was using 16 bits of precision for a long time, but then I realized I could perform an inversion with a 32-bit divide if I use 15 bits of precision (GCC handles 64bit/32bit generating the appropriate x86 assembly; Visual C++ is too stupid and uses the alldiv built-in, this was a change to improve performance for Visual C++, but I figure it should also help on a potential future ARM port, which lacks this extended divide instruction)
So unfortunately pixel shaders, while an option, require a function call and operate on 16 pixels at a time (to spread out the function call overhead), and you'd have to convert to float yourself within these functions if you needed that.
Do you use SSE for individual optimizations like this, or do you use it more generally to process four pixels at a time?
Inside the renderer, I currently only use SSE for bilinear filtering. I tried using it to optimize alpha blending, but it's garbage at that, in fact I was disappointed by its performance improvement over my plain old C implementation too, I expected a lot more than 5ns per pixel (and in fact, when I implemented it with just intrinsics, it was much slower on both GCC and Visual C++).
I perform bilinear filtering using the vectored integer intrinsics added in SSE2. I also use the logical instrinsics added in SSE2 for performing the colorkey test, when that's necessary. This avoids floating point conversion inside the renderer. There's probably some clever way to optimize the non-SSE portions of the assembly code, I've never been much good at assembly, but I did try to take advantage of superscalar execution where I could.
You might like my plain old C implementation for XRGB.
// CHANMASK0 == 0x00ff00ff
// CHANMASK1 == 0xff00ff00
// pels[] = {topleft(uv), topright(uv), bottomleft(uv), bottomright(uv)
// mods[] = premultiplied area of coverage as an 8-bit binary fraction, each stored in 32 bits
// guaranteed: mods[0]+mods[1]+mods[2]+mods[3] == 256
// this arrangement was for SSE2 version, but I changed the C version to use it on a whim
// retrb holds the red and blue channels, retag holds alpha (or x in this case) and green channels
retrb += (pels[0] & CHANMASK0) * mods[0];
retag += ((pels[0] >> 8) & CHANMASK0) * mods[0];
retrb += (pels[1] & CHANMASK0) * mods[1];
retag += ((pels[1] >> 8) & CHANMASK0) * mods[1];
retrb += (pels[2] & CHANMASK0) * mods[2];
retag += ((pels[2] >> 8) & CHANMASK0) * mods[2];
retrb += (pels[3] & CHANMASK0) * mods[3];
retag += ((pels[3] >> 8) & CHANMASK0) * mods[3];
*dest = ((retrb & CHANMASK1) >> 8) | (retag & CHANMASK1);