Hi all,
More than a decade ago, a problem came up on this forum for computing a fast transpose of a 3x3 matrix using SSE. The most sensible implementation stores the matrix internally as a 3x4 matrix (so, one row stores 4 elements, aligned in a vector). A version, which I believe to be the fastest currently known, was presented:
On 6/27/2005 at 9:20 PM, ajas95 said:// input xyz in xmm5,7,6 // output in xmm0,1,7 movaps xmm0, xmm7 // xmm0 : ?? z1 y1 x1 movaps xmm1, xmm5 // xmm1 : ?? z0 y0 x0 unpcklps xmm0, xmm5 // xmm0 : y1 y0 x1 x0 unpckhps xmm7, xmm5 // xmm7 : ?? ?? z1 z0 movhlps xmm1, xmm0 // xmm1 : ?? z1 y1 y0 shufps xmm7, xmm6, 11100100b // xmm7 : ?? z2 z1 z0 movlhps xmm0, xmm6 // xmm0 : ?? x2 x1 x0 shufps xmm1, xmm6, 01010100b // xmm1 : ?? y2 y1 y0
(P.S. If anyone has a faster way, I'd love to hear it. This uses 5 registers, and so destroys one of the inputs. Still, it's a great problem for people that are into this sort of thing).
I am pleased to report that I have been able to come up with a version which should be faster:
inline void transpose(__m128& A, __m128& B, __m128& C) {
//Input rows in __m128& A, B, and C. Output in same.
__m128 T0 = _mm_unpacklo_ps(A,B);
__m128 T1 = _mm_unpackhi_ps(A,B);
A = _mm_movelh_ps(T0,C);
B = _mm_shuffle_ps( T0,C, _MM_SHUFFLE(3,1,3,2) );
C = _mm_shuffle_ps( T1,C, _MM_SHUFFLE(3,2,1,0) );
}
This should be 5 instructions instead of ajas95's 8 instructions. Of course, to get that level of performance with either version, you need to inline everything, or else you spend tons of time on moving floating point arguments to/from input registers.
The other thing that is crucial is that the instruction set be VEX encoded. This allows generating instructions that take three arguments, like `vunpcklps`, instead of instructions like `unpcklps` that take only two. VEX is only available in AVX and higher (usually passing e.g. `-mavx` is sufficient to get the compiler to generate VEX instructions).
-G