You might want to look up rectangulation and rectangular partitioning algorithms. It has actively been researched in other areas, specifically for VLSI and chip manufacturing, so you may need to look up some literature with those terms. But this is a very active area of research (basically anything that has to do with computation geometry is). In the general case, rectangulation can be quite difficult and most of those algorithms are probably not fast enough for your needs, and quite frankly, you probably don't need something that general or optimal. Check out http://www.math.unl.edu/~s-dstolee1/Presentations/Sto08-MinRectPart.pdf
I think you'll find that an optimal solution will be quite difficult to achieve. Without going into a lot of digging and careful reading of the literature, I suspect this problem is NP-complete. http://mathoverflow.net/questions/132603/algorithms-for-covering-a-rectilinear-polygon-using-the-same-multiple-rectangles not exactly the same problem, but very similar to yours.
Am I correct in understanding that your search space is 16 * 16 * 64 = 16384? If that's the number of nodes you have in your A* search space, it's large but still manageable without significant optimization... My game currently has some levels which can have 10000 nodes if I force a very fine nav mesh resolution and with ~80 agents I still have ~300 FPS (can't tell you the A* search time since I haven't instrumented that). It's not completely smooth, but it definitely isn't as bad as 30 ms, so I don't agree with your assertion that "this is of course to be expected". My A* implementation isn't hyper optimized or anything.
I think your best strategy right now is to drop the quest of "minimum" or optimality and go for a reasonable reduction in the search space. Try something dead simple first, like trying to build the largest cube/square and just go in the same direction every time. Perhaps before even that, make sure your A* search is reasonably performant. It doesn't sound like you have a very good implementation to me, but I need a lot more context to be able to know. Are your data structures designed for good cache locality? Are you using a decent priority queue to maintain your open set?
I've found that in my own A* implementation, the biggest gains were in simply making the initialization of the graph faster. My first implementation had this terrible Array of Structs format where each node contained all of its A* state information. For me to initialize the whole graph required individual setting of f, g, and parent states on each node. Very bad for memory performance. I rearranged the data structures to be a Struct of Arrays format where the graph now contains an entire array of f, g, and parent values and now initialization of the graph are very fast memsets() on entire blocks of memory. There are some other clever ways to make the initialization closer to O(1) time but I didn't want to deal with that.
This was a very simple transformation for me to make and yielded enormous performance improvements, so much so that I basically have come to a point where I just don't have to worry about A* performance in my game.