By considering a scene of the --work in progress-- game
Plee the Bear, I will first describe in this article how much work is done during the default rendering procedure. Then it will be compared with an easy to implement improved procedure. And finally I will give you the pointers to an even better procedure. This final procedure has been used in
Andy's Super Great Park.
Background
In the first versions of Plee the Bear we were not really worried about the speed of the rendering procedure, nor the speed of any other procedure. Keeping in mind that
premature optimization is the root of all evil, we had to make things work before making them working fast. That was some years ago. Then the game has grown, we began to put a lot of stuff in the levels and finally the time of thinking about accelerating things did come. That is the subject of this article: how the rendering procedure evolved with the growing of the game. The initial procedure was as simple as possible. Elements are rendered from the background to the foreground, as is. Having something drawn on the screen was a sufficient result at this time. So, what amount of work does this procedure do? Let's see how many times each pixel of the screen is written in a given scene. We will use the very beginning of the first act of the forest of Plee the Bear, just when the player can start to control Plee:

And we will apply the following color representation:

Here is the result for the aforementioned scene:

Not surprisingly, with three layers of rain plus the background, each pixel is written at least 4 times, most of them 5 or 6 times and some are written up to 9 times. And once the rain is gone, the range goes from 1 to 6 writings:

An interesting thing in these two pictures is that even parts hidden by the middle ground decorations are rendered.
Improving the rendering procedure
The improvement we wanted to introduce then was to avoid rendering elements that will be hidden by other elements. The idea is to maintain a representation of the empty parts of the screen whilst considering the elements from the foreground toward the background. For each element there are two steps. First, if the element intersects the empty parts of the screen, we split it into sub-elements that will cover only the empty parts of the screen. Then, if the initial element is opaque, we update the emptiness of the screen. To keep things simple, we represent the parts of the screen with axis-aligned boxes. Elements are considered as opaque if there is no alpha transparency in the source image and if they are not rotated.

Let's come back to the game to see how many times the pixels are written with this procedure. During the rain:

Pixels are written from 2 to 8 times. Contrary to the original procedure, some of them are drawn 2 or 3 times. The number of pixels drawn more than 3 times has been greatly reduced. And after the rain:

Here the range becomes 1-5 writings per pixels, most of them are written 1 or 2 times. Contrary to the original algorithm, we have more pixels written once than three times.
The benchmark
Finally, for all this work to be useful there must be an increase of the performance. That is: more frames rendered per second. To keep a uniform sequence of rendered items among the tests, we use a demo script that runs in the game. Here are the results:

One can see that the new procedure greatly increases the number of frames per second, which is exactly what we wanted.
Can we have more?
Yes! we can do better. You may have noticed on the above captures that some parts of the screen seem to be written several times even if the foreground seems opaque. The main reason is that these foreground sprites have some transparent pixels on one of the edges of their box. Thus, the procedure does not consider any opaque box for them. In order to improve this, we just have to compute some kind of opaque box inside each sprite. More precisely, we want the largest opaque box of each sprite. Is it easy to compute? Well, it may be easy, if you reformulate the problem as the largest rectangle with no transparent pixel. You now have an instance of the well known
Largest Empty Rectangle problem for which you will find good resources, such as an article named
Computing the Largest Empty Rectangle on One- and Two-Dimensional Processor Arrays by Frank Dehne. Contrary to the previous procedure this one cannot be executed at run time (unless you accept the levels to be loaded in several minutes). For our games, we managed to insert the procedure in the level editor, as an optimization step executed when the level is compiled. Then the game engine just has to read the computed opaque boxes and to apply them in the initial procedure.
Conclusion
Optimizing the performance of the code should not be the main work of the developer but, hopefully, it has to be done sometime. Then it is important to use the best resources available. The measures presented in this articles are done with a simple improvement of the basic procedure which brings great results, then we have found good articles to improve this procedure even more. One can also think of computing
k-largest non transparent rectangles in order to obtain more opaque boxes to filter the sprites, in which case comes the question of the limits to apply to the recursion. One important part not explained in this article is how to compute the sub sprites that must be rendered. The procedure is not really difficult but one has to be careful for the special cases like a rotated or mirrored sprite.
Article Update Log
10 June 2013: Initial release
It seems like you could do this with a depth pre-pass too. Render all your sprites (using the largest opaque box) just to the depth buffer, then do the final pass with the real sprites. Have you tried that? I don't know if it would produce as good performance improvement but it would be simpler on the CPU side.