Towards an Optimal VEX-SSE 33float Matrix Transpose

Math and Physics Programming Algorithm LinearAlgebra Optimization

Started by Geometrian September 28, 2017 09:13 PM

0 comments, last by Geometrian 7 years, 4 months ago

Geometrian

Author

1,813

September 28, 2017 09:13 PM

Hi all,

More than a decade ago, a problem came up on this forum for computing a fast transpose of a 3x3 matrix using SSE. The most sensible implementation stores the matrix internally as a 3x4 matrix (so, one row stores 4 elements, aligned in a vector). A version, which I believe to be the fastest currently known, was presented:

On 6/27/2005 at 9:20 PM, ajas95 said:
// input xyz in xmm5,7,6
// output in xmm0,1,7
movaps	xmm0,	xmm7		   // xmm0 : ?? z1 y1 x1
movaps	xmm1,	xmm5		   // xmm1 : ?? z0 y0 x0
unpcklps xmm0,	xmm5		   // xmm0 : y1 y0 x1 x0
unpckhps xmm7,	xmm5		   // xmm7 : ?? ?? z1 z0
movhlps	xmm1,	xmm0		   // xmm1 : ?? z1 y1 y0
shufps	xmm7,	xmm6,	11100100b  // xmm7 : ?? z2 z1 z0
movlhps	xmm0,	xmm6		   // xmm0 : ?? x2 x1 x0
shufps	xmm1,	xmm6,	01010100b  // xmm1 : ?? y2 y1 y0
(P.S. If anyone has a faster way, I'd love to hear it. This uses 5 registers, and so destroys one of the inputs. Still, it's a great problem for people that are into this sort of thing).

I am pleased to report that I have been able to come up with a version which should be faster:


inline void transpose(__m128& A, __m128& B, __m128& C) {
    //Input rows in __m128& A, B, and C.  Output in same.
    __m128 T0 = _mm_unpacklo_ps(A,B);
    __m128 T1 = _mm_unpackhi_ps(A,B);
    A = _mm_movelh_ps(T0,C);
    B = _mm_shuffle_ps( T0,C, _MM_SHUFFLE(3,1,3,2) );
    C = _mm_shuffle_ps( T1,C, _MM_SHUFFLE(3,2,1,0) );
}

This should be 5 instructions instead of ajas95's 8 instructions. Of course, to get that level of performance with either version, you need to inline everything, or else you spend tons of time on moving floating point arguments to/from input registers.

The other thing that is crucial is that the instruction set be VEX encoded. This allows generating instructions that take three arguments, like `vunpcklps`, instead of instructions like `unpcklps` that take only two. VEX is only available in AVX and higher (usually passing e.g. `-mavx` is sufficient to get the compiler to generate VEX instructions).

-G

[size="1"]And a Unix user said rm -rf *.* and all was null and void...|There's no place like 127.0.0.1|The Application "Programmer" has unexpectedly quit. An error of type A.M. has occurred.
[size="2"]

Towards an Optimal VEX-SSE 33float Matrix Transpose

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Towards an Optimal VEX-SSE 3*3*float Matrix Transpose

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines

Towards an Optimal VEX-SSE 33float Matrix Transpose