- Subtract the position of the center of the square from each of the four corners.
- Multiply each resulting vector by the desired rotation matrix. You could do quaternion-vector multiplication: q p q* (* is conjugate), but it's slower than building a 3x3 matrix from the quaternion and multiplying the matrix with each vector.
- Add the position of the center of the square to the rotated vectors to get the rotated point positions.
To show why matrices are faster than quaternions:
// SSE optimized 4x3 matrix * vector4
movss 4(%rdx), %xmm0
pshufd $0, %xmm0, %xmm0
mulps 16(%rsi), %xmm0
movss (%rdx), %xmm1
pshufd $0, %xmm1, %xmm1
mulps (%rsi), %xmm1
addps %xmm0, %xmm1
movss 8(%rdx), %xmm0
pshufd $0, %xmm0, %xmm0
mulps 32(%rsi), %xmm0
addps %xmm1, %xmm0
movaps %xmm0, (%rdi)
Just 12 instructions. Meanwhile a scalar code (no SSE) matrix*vector is 27 instructions. On the other hand, scalar (no SSE) quaternion*vector is 88 instructions. This is with GCC 4.2.
There's no way an SSE-optimized quaternion could beat matrix*vector. Even if you use structure-of-arrays layout, it would be about 88 instructions to do 4 quaternion*vector, which is about 22 instructions per multiply throughput. So, best case SSE quaternion*vector would be almost twice as slow as SSE matrix*vector, and you would have to have the data in the right layout for it to be fast at all.
With AVX, the matrix*vector becomes just 9 instructions due to broadcast instructions, and quaternion is still only 11 instructions per multiply throughput if we use SoA layout.
IMO the only good use of quaternions is for storing and interpolating animation or for networking, where the size of the rotation representation is important and/or you need to slerp.