Advertisement

C# System.Numerics.Vectors slow?

Started by November 23, 2015 09:40 AM
8 comments, last by vanka78bg 9 years, 2 months ago

Hello!

Recently I have decided to test the new System.Numerics.Vectors library that comes with .NET 4.6. I have downloaded the last NuGet package 4.1.0 and tested it with Visual Studio 2015. The problem I have encountered is that the library seems to perform much slower than what I would expect. For instance, here is a sample C# program and a corresponding C++ program for comparison:


static void Main()
{
    var vector = new Vector4();
    var matrix = new Matrix4x4();
    var stopwatch = new Stopwatch();

    stopwatch.Start();

    for (int index = 0; index < 1000000000; index++)
    {
        vector = Vector4.Transform(vector, matrix);
        vector = Vector4.Transform(vector, matrix);
        vector = Vector4.Transform(vector, matrix);
        vector = Vector4.Transform(vector, matrix);
        vector = Vector4.Transform(vector, matrix);
        vector = Vector4.Transform(vector, matrix);
        vector = Vector4.Transform(vector, matrix);
        vector = Vector4.Transform(vector, matrix);
        vector = Vector4.Transform(vector, matrix);
        vector = Vector4.Transform(vector, matrix);
    }

    stopwatch.Stop();

    Console.WriteLine($"Time: {stopwatch.Elapsed}");
}

int main()
{
    XMVECTOR vector{};
    XMMATRIX matrix{};
    time_t start;
    time_t final;

    time(&start);

    for (int index = 0; index < 1000000000; index++)
    {
        vector = XMVector4Transform(vector, matrix);
        vector = XMVector4Transform(vector, matrix);
        vector = XMVector4Transform(vector, matrix);
        vector = XMVector4Transform(vector, matrix);
        vector = XMVector4Transform(vector, matrix);
        vector = XMVector4Transform(vector, matrix);
        vector = XMVector4Transform(vector, matrix);
        vector = XMVector4Transform(vector, matrix);
        vector = XMVector4Transform(vector, matrix);
        vector = XMVector4Transform(vector, matrix);
    }

    time(&final);

    cout << "Time: " << final - start << endl;

    return 0;
}

This is by no means an accurate benchmark. I am including this sample code just to give you a clue what I am talking about. For instance, the above C# code completes in more than 2 minutes, while the similar C++ code takes about 50 seconds. I am testing on a relatively fast i7 CPU, x64 targets in Release mode, to make sure RyuJIT does its SIMD magic, or at least Vector.IsHardwareAccelerated returns True during my tests.

I was expecting the C++ code to be faster than the C# one, but 3x faster seems like a lot to me. It is almost as if the C++ compiler emits real SIMD code, while the C# JIT does not do that at all, even though Vector.IsHardwareAccelerated is True. I guess I am doing something wrong, or perhaps benchmarks like this are simply too inaccurate and give me misleading results. I have searched for more accurate benchmarks online, that can give me some pointers what performance to expect from System.Numerics.Vectors versus a pure native C++ implementation, but I have failed to find anything so far.

I need to evaluate whether System.Numerics.Vectors is fast enough for implementing a certain computationally expensive task, consisting mostly of vector and matrix math, or I would be better off implementing the algorithm in C++ and call it from my C# code via PInvoke. Since I am a newbie to this and I am not very experienced in C++, I would prefer to stick with C# if possible.

Regards.

Looking at that code, I would be suprised if it is the Numerics class that is slow.

I think you are just creating a whole load of garbage and triggering the GC.

10 * sizeof (Vector4) * 1000000000 is a lot of garbage.

Advertisement

Looking at that code, I would be suprised if it is the Numerics class that is slow.

I think you are just creating a whole load of garbage and triggering the GC.

10 * sizeof (Vector4) * 1000000000 is a lot of garbage.

I am not quite sure that GC has something to do with my performance issues. Both Vector4 and Matrix4x4 are implemented as value types (struct), so they are supposed to be allocated on the stack. What bothers me is that Vector4.Transform passes its arguments by value, which might lead to some copying on the stack, but I am clueless how RyuJIT optimizes that in x64 release mode. Is there any way to check the generated assembly instructions in Release mode with optimizations turned on?

After investigating the generated assembly instructions, it seems System.Numerics.Vectors does not generate SIMD instructions at all. Here is the generated assembly for the C# code:


vector = Vector4.Transform(vector, matrix);

00007FF8734446B5  lea         rax,[rsp+350h]  
00007FF8734446BD  movss       xmm0,dword ptr [rax]  
00007FF8734446C1  movss       xmm1,dword ptr [rax+4]  
00007FF8734446C6  movss       xmm2,dword ptr [rax+8]  
00007FF8734446CB  movss       xmm3,dword ptr [rax+0Ch]  
00007FF8734446D0  movdqu      xmm4,xmmword ptr [rsp+310h]  
00007FF8734446D9  movdqu      xmmword ptr [rsp+290h],xmm4  
00007FF8734446E2  movdqu      xmm4,xmmword ptr [rsp+320h]  
00007FF8734446EB  movdqu      xmmword ptr [rsp+2A0h],xmm4  
00007FF8734446F4  movdqu      xmm4,xmmword ptr [rsp+330h]  
00007FF8734446FD  movdqu      xmmword ptr [rsp+2B0h],xmm4  
00007FF873444706  movdqu      xmm4,xmmword ptr [rsp+340h]  
00007FF87344470F  movdqu      xmmword ptr [rsp+2C0h],xmm4  
00007FF873444718  movaps      xmm4,xmm0  
00007FF87344471B  mulss       xmm4,dword ptr [rsp+290h]  
00007FF873444724  movaps      xmm5,xmm1  
00007FF873444727  mulss       xmm5,dword ptr [rsp+2A0h]  
00007FF873444730  addss       xmm4,xmm5  
00007FF873444734  movaps      xmm5,xmm2  
00007FF873444737  mulss       xmm5,dword ptr [rsp+2B0h]  
00007FF873444740  addss       xmm4,xmm5  
00007FF873444744  movaps      xmm5,xmm3  
00007FF873444747  mulss       xmm5,dword ptr [rsp+2C0h]  
00007FF873444750  addss       xmm4,xmm5  
00007FF873444754  movaps      xmm5,xmm0  
00007FF873444757  mulss       xmm5,dword ptr [rsp+294h]  
00007FF873444760  movaps      xmm6,xmm1  
00007FF873444763  mulss       xmm6,dword ptr [rsp+2A4h]  
00007FF87344476C  addss       xmm5,xmm6  
00007FF873444770  movaps      xmm6,xmm2  
00007FF873444773  mulss       xmm6,dword ptr [rsp+2B4h]  
00007FF87344477C  addss       xmm5,xmm6  
00007FF873444780  movaps      xmm6,xmm3  
00007FF873444783  mulss       xmm6,dword ptr [rsp+2C4h]  
00007FF87344478C  addss       xmm5,xmm6  
00007FF873444790  movaps      xmm6,xmm0  
00007FF873444793  mulss       xmm6,dword ptr [rsp+298h]  
00007FF87344479C  movaps      xmm7,xmm1  
00007FF87344479F  mulss       xmm7,dword ptr [rsp+2A8h]  
00007FF8734447A8  addss       xmm6,xmm7  
00007FF8734447AC  movaps      xmm7,xmm2  
00007FF8734447AF  mulss       xmm7,dword ptr [rsp+2B8h]  
00007FF8734447B8  addss       xmm6,xmm7  
00007FF8734447BC  movaps      xmm7,xmm3  
00007FF8734447BF  mulss       xmm7,dword ptr [rsp+2C8h]  
00007FF8734447C8  addss       xmm6,xmm7  
00007FF8734447CC  mulss       xmm0,dword ptr [rsp+29Ch]  
00007FF8734447D5  mulss       xmm1,dword ptr [rsp+2ACh]  
00007FF8734447DE  addss       xmm0,xmm1  
00007FF8734447E2  movaps      xmm1,xmm2  
00007FF8734447E5  mulss       xmm1,dword ptr [rsp+2BCh]  
00007FF8734447EE  addss       xmm0,xmm1  
00007FF8734447F2  movaps      xmm1,xmm3  
00007FF8734447F5  mulss       xmm1,dword ptr [rsp+2CCh]  
00007FF8734447FE  addss       xmm0,xmm1  
00007FF873444802  movss       xmm1,xmm0  
00007FF873444806  pslldq      xmm1,4  
00007FF87344480B  movss       xmm1,xmm6  
00007FF87344480F  pslldq      xmm1,4  
00007FF873444814  movss       xmm1,xmm5  
00007FF873444818  pslldq      xmm1,4  
00007FF87344481D  movss       xmm1,xmm4  
00007FF873444821  movaps      xmm0,xmm1  
00007FF873444824  movaps      xmmword ptr [rsp+350h],xmm0

And here is the assembly for the C++ code:


vector = XMVector4Transform(vector, matrix);

00007FF629AE108E  movaps      xmm3,xmm2  
00007FF629AE1091  movaps      xmm0,xmm2  
00007FF629AE1094  shufps      xmm3,xmm2,0FFh  
00007FF629AE1098  movaps      xmm1,xmm2  
00007FF629AE109B  shufps      xmm1,xmm2,55h  
00007FF629AE109F  shufps      xmm0,xmm2,0AAh  
00007FF629AE10A3  mulps       xmm1,xmm6  
00007FF629AE10A6  mulps       xmm3,xmm4  
00007FF629AE10A9  shufps      xmm2,xmm2,0  
00007FF629AE10AD  mulps       xmm2,xmm7  
00007FF629AE10B0  mulps       xmm0,xmm5  
00007FF629AE10B3  addps       xmm1,xmm2  
00007FF629AE10B6  addps       xmm3,xmm0  
00007FF629AE10B9  addps       xmm3,xmm1  

I am not an expert in assembly myself, but it seems the C# version uses scalar multiplications, while the C++ one resorts to vector ones. So the C# code is not vectorized at all, even though Vector4.IsHardwareAccelerated lies to be True. Perhaps I am missing something here.

First, due diligence: are you sure you're actually running 64-bit? By default new projects get that dumb "Prefer 32-bit" option checked, so an AnyCPU build will still run using 32-bit. Also, make sure if you've got the debugger attached that you don't have it set to inhibit optimizations. By default most optimizations are disabled even in release mode when a debugger is present, for obvious reasons.

Also, be careful about which package you're using. The JIT looks for a very particular assembly to enable intrinsics, so if there is a mismatch in which build you're using it will just not optimize it. Doesn't .NET 4.6 ship with an older System.Numerics.Vectors assembly built in, or is it still recommended to grab one from NuGet?

Mike Popoloski | Journal | SlimDX
It's definitely 64-bit; his disassembler is showing 64-bit instruction addresses.

Definitely looks like it's using scalar instructions. Perhaps Visual Studio's "Suppress JIT optimization" thing is causing this? Have you turned that off yet? There are some strange things I've had to do in the past to view fully optimized JIT disassembly; attach after launching, executing the function in question in a tight loop, etc.
Advertisement

Those look like SIMD registers to me. xmm# instructions are the SIMD registers for your CPU. It just seems that the C# compiles to some pretty inefficient output. I see some extraneous shifting and moving. It's funny because it is doing MOVAPS but immediately performs MULSS instead of MULPS. It looks like it's pulling out single pieces of your matrix one at a time, placing them into the xmm registers, and then performing MULSS on a single scalar, rather than loading up 4 at a time doing doing MULPS. I have no idea why though, (here comes random guesses) maybe the memory alignment is poor, or the transforms are stored transposed in memory.

http://stackoverflow.com/questions/30027707/what-is-the-difference-between-non-packed-and-packed-instruction-in-the-context

Those look like SIMD registers to me. xmm# instructions are the SIMD registers for your CPU. It just seems that the C# compiles to some pretty inefficient output. I see some extraneous shifting and moving. It's funny because it is doing MOVAPS but immediately performs MULSS instead of MULPS. It looks like it's pulling out single pieces of your matrix one at a time, placing them into the xmm registers, and then performing MULSS on a single scalar, rather than loading up 4 at a time doing doing MULPS. I have no idea why though, (here comes random guesses) maybe the memory alignment is poor, or the transforms are stored transposed in memory.

http://stackoverflow.com/questions/30027707/what-is-the-difference-between-non-packed-and-packed-instruction-in-the-context

On AMD64 the SSE instruction set is used by default for any floating-point computations instead of the x87 FPU.

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”

The Matrix and Quaternion functions aren't yet recognized by RyuJIT, if I'm not mistaken. Only a subset of functions are marked with the JitIntrinsic attribute. You can find this set by looking at the *_Intrinsics.cs files in the corefx repo: https://github.com/dotnet/corefx/tree/master/src/System.Numerics.Vectors/src/System/Numerics

There is also the Vector type that exposes intrinsics more suitable for wide processing.

I just ended up creating custom types based on the set of definitely-accelerated functions. It can take some handholding and the occasional workaround, but it is possible to get most of the expected benefits. I've seen >4x speedups in some of my constraint solver prototyping work, and the vast majority of the boost was from SIMD. (I do wish for shuffles, though...)

Thank you for the provided help! This proves to be a great community once again!


The Matrix and Quaternion functions aren't yet recognized by RyuJIT, if I'm not mistaken. Only a subset of functions are marked with the JitIntrinsic attribute.

I did not knew that not all operations have support in RyuJIT yet. It seems .NET SIMD support is a work in progress and I hope things to improve in the future. Since this is for a personal project I am working on in my spare time, I do not have any pressure for it. For this reason I'll stick with my C# implementation, while paying attention to operations that are not SIMD accelerated yet. Due to a lack of "real world" experience in C++ I would rather not go to that route, because that would seriously slow down my progress.

This topic is closed to new replies.

Advertisement