Comments
What sort of rendering pipeline are you actually using? Is this using older-style blitting? If so, then I think your approach is ideal, although the blitting methods themselves are not very hardware accelerated and thus possibly not the best if you're going to have lots of transparency and other elements.
You might consider doing some more specific profiling to see exactly what is causing your FPS drops. What this article concerns is pixel fill rate on the GPU, and certainly that is something that is a limiting factor. But it's really mainly a limiter on older GPUs at the scale of the graphics you're using here, whereas there are some much tigher bottlenecks even on very new GPUs.
The constraints I'm speaking of are material swaps (where textures and such are sent repeatedly to the GPU back and forth during one frame) and the number of "draw calls" in many engines, which mostly just means how many vertex sets are sent.
For myself and my own work, I've focused on reducing material swaps in the main, and then to a lesser extent focused on reducing pixel fill rate and draw calls.
The main thing with reducing material swaps is sorting the draws per layer in the scene, by material, and then not switching materials between draw calls if the new material is the same as the last.
For reducing draw calls, my main approach has been to omit compeltely obscured sprites, as you have here. But also to combine smaller sprites that share a texture into one larger sprite if they are contiguous, which seems the opposite of what you are doing here. It seems like your approach will lead to a lower pixel fill rate requirement, in other words, but to a larger number of draw calls. I am not sure this is a good tradeoff, but you'd have to profile it in your specific engine in order to really see.
There are some other indies I know with vastly more sophisticated techniques than what I'm using. They are using geometry combination methods that reduce the draw calls even for non-contiguous usages of a given texture, and I believe even across draw depths. And through the use of sprite dictionaries for way more things than I tend to use them for, they're also achiveing vastly fewer texture swaps, too.
I'm really impressed with what they are doing, but like you I think that optimization should only be taken so far, since there are always downsides. In my case I feel like those sprite dictionaries and so forth slow development to an unhelpful degree when there are lots of sprites (one of my games has north of 5000 individual sprite frames), so there's always a tradeoff there.
Doing the sub bounding boxes that you were suggesting as the next step would certainly be possible, but I think you will get diminishing returns from that compared to what you have already done. And, as I think you somewhat implied, I think the work for that next step of the sub bounding boxes will be a lot more than the work that it took to get you to this point.
Anyway -- very good article, and I like the way you were visualizing the pixel fill repetition. Very clever! I would suggest texture sorting for material swap reduction, and then geometry combination for draw call reduction, as the next avenues for getting your fps higher, though.
Cheers!
@phil_t Before implementing this solution I did implement a solution based on the depth test but it was not convincing. It saved the computation of the fragments' color but it still cost the depth test. Nevertheless I have never tried a two passes rendering as you describe.
@x4000 It is indeed an old style blitting procedure. At the time of the article there was still the possibility of changing the back end, thus I wanted a rendering solution that would still work with another tool. Now the rendering is done using OpenGL and a better ordering of the materials should indeed improve the procedure. Right now there is no optimization of the textures' content nor do we minimize the texture switches. So thanks for the hints :)
@cdoty I have never seen worst result with this procedure than with the previous one even if, as you have noticed, the new procedure increases the number of vertices. My first reflex was to limit the size of the sub textures but actually it made no difference. Maybe a combination with the improvement proposed by x4000 would lead to different results on this subject.
cdoty - it seems unlikely that the author is using hardware acceleration at all (at least directly, if he were then a depth pre-pass would be better, as others have said) so triangle setup is irrelevant to him -- but I can make an educate guess that the answer to your question is "When they are very, very small". Probably something on the order of 16 pixels or even smaller. The same is probably true of the logic that the author implements for his rectangles.
In what experience I have with similar optimizations, I have always found that the tipping point is very low on the scale.
@j-jorge: Ah, ok! If it's a blitting infrastructure, then absolutely your way (and your proposed second step for improvements) are clearly the best method by far. Overdraw is such a huge issue with blitting methods, for sure.
Anyhow, glad to hear the hints might be helpful for your new OpenGL methods. I too started with blitting methods (DirectX7 back in the 2002-2003 era), and moving to the 3D approach was really worthwhile but also challenging at first. There are still some vertex optimizations that I'd like to do but don't know how to do in an efficient way. I need to sit down with it more at some point, but I'd been focused on the low-hanging fruit of the textures and such prior to that.
I have made the same approach recentlry and found that this technique can be good only on old video cards or perhaps java or html5 (not tested). On directx and mediocre video card that would be even slower. For example in my case dividing sprite to peaces algorithm tooks me about 0.250 ms for whole scene and there was only 0.050 ms improvement comparing without algorithm... so 0.200 ms worse... and that qould be great for just some parts of scene (for ex background), but not for whole scene (background units, bullets, interier).
Very nice article, and thanks for linking to the paper on Largest Empty Rectangle. I am wondering if you can also provide some information on how you subdivide("shatter") the shapes? Do you simply divide the rects in 1/2 and continue to divide each subsection that is overlapped in half until the smaller rectangles are not hidden(Similar to a quadtree)?
@MikeS I process the shapes from the front to the back while keeping a list L of rectangles representing the parts of the screen not yet covered by an opaque shape. When a new shape is considered, I search the rectangles in L intersecting the shape then, for each of them:
- I create for a new shape with the part contained in the rectangle,
- I remove the rectangle from L, subtract the opaque rectangle of the shape to obtain zero to four new rectangles representing the non covered parts, and I insert these new rectangles in L.
There is no quadtree but… hem… it seems perfect for L. I don't know why I did not used them, I should have thought about it :/ Also, the way I split the non covered parts, as shown by the fifth picture, has a tendency to produce vertical stripes that lead to visual artifacts. I should be easy to produce more regular rectangles.
So you are developing a 2D game and, suddenly, you discover that the rendering procedure is too slow. What would you do? Would you put less elements in your levels or would you render them more efficiently? Since you are here, I suppose that you would pick the latter. So, here we go, I am going to tell you how to render less things while having more.
It seems like you could do this with a depth pre-pass too. Render all your sprites (using the largest opaque box) just to the depth buffer, then do the final pass with the real sprites. Have you tried that? I don't know if it would produce as good performance improvement but it would be simpler on the CPU side.