HI all.
I am trying to use the power of hardware intrinsic and just for test create one function based on Avx2 instructions and compare it with my current Vector implementation with no intrinsic at all.
When I did a benchmark of 2 functions doing the same, I was impressed that intrinsics function actually 2 times slower. I investigate that found out that calculations itself ~3.8 times faster, but when I starting to create my wrapper structure and return the result, it actually where the most time is spend.
Here is my implementations for intrinsic method:public static Vector4FHW Subtract(Vector4FHW left, Vector4FHW right)
{
if (Avx2.IsSupported)
{
var left1 = Vector128.Create(left.X, left.Y, left.Z, left.W);
var right1 = Vector128.Create(right.X, right.Y, right.Z, right.W);
var result = Avx2.Subtract(left1, right1);
var x = result.GetElement(0);
var y = result.GetElement(1);
var z = result.GetElement(2);
var w = result.GetElement(3);
return new Vector4FHW(x, y, z, w);
}
return default;
}
And here is my naive implementation of old Vector:
public static void Subtract(ref Vector3D left, ref Vector3D right, out Vector3D result)
{
result = new Vector3D(left.X - right.X, left.Y - right.Y, left.Z - right.Z);
}
I made benchmark with BenchmarkDotNet where I call Subtract 1 000 000 times and here is my results:
With HW support I have ~3170 us, without - 970 us
And my main question: what I am doing wrong that creating C# struct with values takes soooo long comparing to my old implementation?