I see. But you have all objects, transforms and bounding boxes available. And you could make octree from all objects that do not move, then traverse it once to make groups.
I guess that's fast enough, but i won't give perfect groupings like all objects in a room. It would just cluster nearby stuff, and it would hurt the occlusion culling method as well. Could be a problem, but the link only says ‘culling’ and nothing about occlusion. So is it just frustum culling? (Which would suffer too a bit, i guess)
However, i have used this octree method for a hidden surface removal algorithm, tackling the same problem of visible interior details through windows. Worked for me. Though my goal was occlusion culling using software rendering and having manually made low polys representing the walls of the houses. So the low quality grouping bounding boxes from octree often peeked through the walls and got rendered although not visible, but it was still a big win and fast enough for dynamic scenes.
I did not bother to optimize this further by fusing textures and making UV atlas, but sounds no big problem. Maybe GPU could transform UVs with given UV transfrom per object.
(I remember i once tested to index thousands of transform matrices per vertex from video memory to draw thousands of objects with one draw call. This also worked well, so you could keep groups intact even with animation and timeslice grouping workloads.)
Maybe the worst performance issue would be to generate the fused textures. That's a lot of brute force work for CPU. A compute shader could do such composition if available? But not sure if practical with memory limits. And i'd hope they all use the same compression mode.