The question you're asking has little to do with OpenGL itself. It is further a question with many possible answers. The following is one of them...
First of, there is not really a "render loop". Instead, (graphical) rendering is the very last step in the so-called game loop. Before rendering there are updates on input processing, AI, animation, physics, collision detection and correction, and perhaps others. This coarse layout of the game loop is usually understood as a sequence of so-called sub-systems. In this sense rendering is a sub-system.
When looking at the various sub-systems, question is how they all work. Is it possible that all of them have the same requirements? Unlikely! Instead, for different tasks during processing the game loop there are different data structures that are suitable to perform the particular tasks. This means that a data structure like the classic scene graph is probably not suitable.
I write this because the scene graph approach is often taught in books, and it seems me to be the case here, too. A scene graph is a single structure and looks promising on the first look, but it tends to become a nightmare the more tasks are tried to be solved with it. You asked for "high-performance", and such a scene graph does not belong to the same set of tools. This does not mean that scene graphs are bad per se; if the scene graph is used for a single purpose then it is as good as another structure.
Now, with respect to rendering the above thought has several implications. As can be seen from the sequence of sub-system processing, all the game objects must already be placed properly in the world, or else collision detection and correction would not have been done meaningfully. That means that "chaining of transformation matrices" is absolutely no thing of rendering at all. Instead, the process of rendering can be seen as follows:
1.) Iterate all objects in the scene and determine which are visible.
2.) For all objects that passes the visibility test above, put a rendering job into one of perhaps several lists. Here several lists may be used to a-priorily distinguish between opaque and non-opaque objects, for example. Such a rendering job should hold enough informations to later on let the low-level rendering do what it has to do.
2a.) Skin meshes may be computed just now, i.e. after it has been determined that they are visible.
3.) The lists will then be sorted by some criteria, e.g. considering the costs of resource switching (texture, mesh, shader, whatever) and, in the case of non-opaque objects, especially their order w.r.t. the camera view.
4.) The low-level rendering then iterates the sorted lists in given order, uses the data from each rendering job to set-up the rendering state (blending mode, binding textures, VBOs, shaders, ..., as much as needed but as less as possible) and invokes the appropriate OpenGL drawing routine.
You can see from the above again that OpenGL itself is not in the foreground, even we are directly discussing rendering now.
The question for rendering passes is the question for which kind of rendering you want to implement. Forward shading, deferred shading, ..., which kind of shadowing algorithm you want to use, and whether you want to support non-opaque objects. Besides this, each rendering pass is more or less the same as described above but obviously with different set-up and rendering states.
Organization of game objects can be done in various ways. However, from the above it should be clear that different aspects of game objects should be handled differently. A generally good approach is to prefer composition of game objects (instead of god classes or a wide spread inheritance tree).
Well, all this is perhaps not want to wanted to read, and I know that it is mostly vague. However, you must understand that a full fledged solution has many many aspects, and discussing them in a single post (or even thread) is not possible. This is, by the way, a reason why books tend to suggest the usage of scene graphs. It can also be understood as a hint for beginners to keep with the scene graph approach for now. In the end it's up to you to think about which way you want to go. However, decoupling things makes re-factoring easier. Decoupling is at least something you should consider.
Looking out for your answer ... ;)