18 hours ago, bzt said:
Correction, there's only approximation exists for square root calculation, and no explicit form (not even in mathematical theory).
Okay, you are technically right, but I think a little bit nitpicky here. What I meant is that there is no built-in fast approximation of the square root that I know of. Also: "std::sqrt
is required by the IEEE standard to be exact." (source).
So even if an approximation formula is used to get to the result, the result is exact (in terms of floating-point precision). If you have a look into the intel intrinsics guide, they also talk about an approximation in the documentation of the fast reciprocal square root (_mm_rsqrt_ss), while they don't in _mm_sqrt_ss. So the term approximation refers to the result and not to the method.
18 hours ago, bzt said:
You could've simply used mm_sqrt_ss (the SQRTSS instruction) or even mm_sqrt_ps.
No, I couldn't because this is the "exact" (see above) and "slow" solution which should result in the same timings as for std::sqrt. I haven't tested it, but I think any smart compiler would use this under the hood if it is faster than the non-vector-register version. Also, compare the latencies of the involved instructions. It will get you to the same assessment (see below).
What I was aiming for here was to calculate the approximate sqrt result as fast as possible by sacrificing accuracy (as intended by the TO). There only exists a built-in fast reciprocal square root but no fast square root (at least that I know). Dividing by the fast inverse square root gives an "approximate" result for the square root. However, this will only be faster than the "exact" square root (_mm_sqrt_ss), if you also use another approximation to calculate the reciprocal. That is what I was doing. To give you numbers from the intrinsics guide, here are som latencies for a skylake:
_mm_sqrt_ss: 13
_mm_rsqrt_ss + _mm_div_ss: 4 + 11 = 15
_mm_rsqrt_ss + _mm_rcp_ss: 4 + 4 = 8
This pretty much agrees with the results from the benchmark.
18 hours ago, bzt said:
Further optimisation would involve gcc and Clang specific attributes, like __attribute__((leaf)), which would tell the compiler that this is a static function which does not call other functions so the optimizer could eliminate all memory access (variables and stack as well) alltogether and inline the function if that's appropriate.
Well, I don't know how much experience you have with benchmarking this low-level stuff, but nowadays compilers are super smart and aggressive when it comes to optimizations. I haven't looked at the produced assembly but I am pretty sure the function gets inlined automatically. I mean, without using the "DoNotOptimize" function of the benchmark library I couldn't even produce any timings since the compiler removes the whole operation. Probably because he can calculate the result at compile-time (or for whatever other reason). Also, seeing that the timings do agree well with the latencies from the intrinsic guide, I guess you can expect that the compiler eliminated every possible overhead.
In my experience, there is often no point in trying to help the compiler in order to produce more performant code. They are so advanced that they can perform most optimizations without any help. I don't want to say that there is no room for optimizations in the presented benchmarks since I am just a passionate hobbyist and no professional full-time programmer. However, the things I said are based on the experiences I made by writing a lot of benchmarks and trying to optimize my code for raw performance. If you made different experiences, feel free to share them
Greetings