another question why is it ?2
Because of the Pythagorean Theorem. When you have a square of sides measuring 1, the diagonal of that square measures ?2.
A cube is made up of 6 squares, so the same relation is carried. The only difference is that instead of sides of length 1, you want sides of a certain length in pixels. You multiply that pixel length by ?2 to get the actual size of the diagonal in pixels.
When the cube is in its middle frame (exactly at a 45º rotation), the diagonal of its side faces is aligned to the ground. That's precisely its entire horizontal span when looked at from above, so you know the biggest horizontal frame size you need is that.
This is what happens when you look at it from the side (not from above like you want), just to illustrate:
Just, instead of the above values of (1) and (?2), in your actual animation it'll be (side_length_in_pixels) and (side_length_in_pixels * ?2), respectively.
Also it would be slightly less than the square root of 2 of you were to be true to perspective.
Good point, I forgot to mention that I was referring to an orthogonal projection, so that's why it's exactly Sqr(2).
I would also animate it in 3D like you did - let the software do the interpolation and lighting, you just worry about animation.