Advertisement

3 quick ways to calculate the square root in c++

Started by October 22, 2019 01:17 AM
37 comments, last by l0calh05t 4 years, 11 months ago
4 hours ago, DerTroll said:

I guess the compilers are inlining the function in the benchmarks and might perform some other optimizations in this context. I don't really know since micro benchmarking such low-level stuff is always a fight against the aggressive compiler xD.

Yes, inline would remove/reduce both overheads. Just move the definitions to a different file and disable IPO to get rid of that effect.

On 11/9/2019 at 1:53 PM, DerTroll said:

since there is no approximation function for the square root itself (at least that I know)

Correction, there's only approximation exists for square root calculation, and no explicit form (not even in mathematical theory).

On 11/9/2019 at 1:53 PM, DerTroll said:

`SquareRootSSE` uses an approximation function to calculate the reciprocal of the reciprocal square root

You could've simply used mm_sqrt_ss (the SQRTSS instruction) or even mm_sqrt_ps.

Further optimisation would involve gcc and Clang specific attributes, like __attribute__((leaf)), which would tell the compiler that this is a static function which does not call other functions so the optimizer could eliminate all memory access (variables and stack as well) alltogether and inline the function if that's appropriate. (On x86 calling a function necessarily involves stack except when it's inlined, but stack frame (with function prologue and epilogue) is usually eliminated, however on ARM the leaf function's return address is stored in a register therefore sqrt would not access the MMU or the memory cache at all.)

Otherwise I'd like to say a big thanks for your benchmarks, they are very correct and extremely helpful. I think the only advantage the OP's solution (John Carmack's inverse square root) has is that it is a portable C solution, as all CPUs produced in the last thirty years or so use IEEE-754. Most notably, both x86 (desktop games) and ARM (games for tablets, mobiles). But you should use an ifdef ARCH and use the architecture's hardware accelerated sqrt() instruction, as that's always faster, and there's only two targets you might consider anyway (if you want to support 32 bit versions too then four tops).

Cheers,
bzt

Advertisement
18 hours ago, bzt said:

Correction, there's only approximation exists for square root calculation, and no explicit form (not even in mathematical theory).

Okay, you are technically right, but I think a little bit nitpicky here. What I meant is that there is no built-in fast approximation of the square root that  I know of. Also: "std::sqrt is required by the IEEE standard to be exact." (source).

So even if an approximation formula is used to get to the result, the result is exact (in terms of floating-point precision). If you have a look into the intel intrinsics guide, they also talk about an approximation in the documentation of the fast reciprocal square root (_mm_rsqrt_ss), while they don't in _mm_sqrt_ss. So the term approximation refers to the result and not to the method.

 

18 hours ago, bzt said:

You could've simply used mm_sqrt_ss (the SQRTSS instruction) or even mm_sqrt_ps. 

No, I couldn't because this is the "exact" (see above) and "slow" solution which should result in the same timings as for std::sqrt. I haven't tested it, but I think any smart compiler would use this under the hood if it is faster than the non-vector-register version. Also, compare the latencies of the involved instructions. It will get you to the same assessment (see below).

What I was aiming for here was to calculate the approximate sqrt result as fast as possible by sacrificing accuracy (as intended by the TO). There only exists a built-in fast reciprocal square root but no fast square root (at least that I know). Dividing by the fast inverse square root gives an "approximate" result for the square root. However, this will only be faster than the "exact" square root (_mm_sqrt_ss), if you also use another approximation to calculate the reciprocal. That is what I was doing. To give you numbers from the intrinsics guide, here are som latencies for a skylake:


_mm_sqrt_ss: 13

_mm_rsqrt_ss + _mm_div_ss: 4 + 11 = 15

_mm_rsqrt_ss + _mm_rcp_ss: 4 + 4 = 8

This pretty much agrees with the results from the benchmark.

 

18 hours ago, bzt said:

Further optimisation would involve gcc and Clang specific attributes, like __attribute__((leaf)), which would tell the compiler that this is a static function which does not call other functions so the optimizer could eliminate all memory access (variables and stack as well) alltogether and inline the function if that's appropriate.

Well, I don't know how much experience you have with benchmarking this low-level stuff, but nowadays compilers are super smart and aggressive when it comes to optimizations. I haven't looked at the produced assembly but I am pretty sure the function gets inlined automatically. I mean, without using the "DoNotOptimize" function of the benchmark library I couldn't even produce any timings since the compiler removes the whole operation. Probably because he can calculate the result at compile-time (or for whatever other reason). Also, seeing that the timings do agree well with the latencies from the intrinsic guide, I guess you can expect that the compiler eliminated every possible overhead.

In my experience, there is often no point in trying to help the compiler in order to produce more performant code. They are so advanced that they can perform most optimizations without any help. I don't want to say that there is no room for optimizations in the presented benchmarks since I am just a passionate hobbyist and no professional full-time programmer. However, the things I said are based on the experiences I made by writing a lot of benchmarks and trying to optimize my code for raw performance. If you made different experiences, feel free to share them :)

Greetings

19 minutes ago, DerTroll said:

So the term approximation refers to the result and not to the method.

19 minutes ago, DerTroll said:

There only exists a built-in fast reciprocal square root but no fast square root (at least that I know).

Oh I see. That makes perfect sense. Thank you again for your through analysis, benchmarks and examples. They are very informative and helpful!

20 minutes ago, DerTroll said:

In my experience, there is often no point in trying to help the compiler in order to produce more performant code.

Interesting. My experience is quite the opposite. I usually check the disassembly of frequently used and performance-critical functions, and I always find it easy to figure out a way to improve them by a lot. For example last time I run into this:


void somefunc(float a[4], float b[4])
  float dot = a[0]*b[0] + a[1]*b[1] + a[2]*b[2] + a[3]*b[3];

Even with "-O3 -ffast-math -msse4" the optimizer produced 4 individual loads and mm_mul_ss instructions instead of one mm_mul_ps, let alone to emit a single sse4-optimized mm_dp_ps instruction for this code.

But I'm no ordinary programmer, I'm a black-belt Assembly coder on several architectures for about 30 years now. ;-) It wouldn't be fair to compare my skills with a machine's.

Cheers,
bzt

2 hours ago, bzt said:

But I'm no ordinary programmer, I'm a black-belt Assembly coder on several architectures for about 30 years now. ? It wouldn't be fair to compare my skills with a machine's.

Okay, no chance for me to win the smart-ass battle here. I yield ?! My assembly skills are extremely limited. I know enough to understand most of the stuff the compiler produces and to make some performance assessments but I am far from being able to optimize pure assembly myself. However, I am quite familiar with the intrinsic functions to write vectorized code, which brings me to your example with the dot product.

Maybe I was a little bit too much focussed on stuff like inlining and removing unnecessary operations when I wrote my answer. Unfortunately, auto-vectorization isn't working so well, I agree. Otherwise, I wouldn't be messing around with SSE and AVX myself.

Greetings

@bzt: It is unsurprising that "somefunc" isn't turned into a SIMD expression, as that would be strictly illegal. Passing float a[4] means that you are passing a pointer(!) which isn't even aligned to a 16 byte boundary. Still true that most compilers fail at this type of optimization, even if the code is adapted to actually make the optimization legal.

https://godbolt.org/z/-Hq43G is the closest I have gotten without directly using SIMD intrinsics (the __m128 parameters are necessary, as it is the only way to get the values passed in via registers).

Advertisement

On MSVC you'll additionally have to use the vectorcall calling convention to get the m128s passed in registers

l0calh05t said:

@bzt: It is unsurprising that "somefunc" isn't turned into a SIMD expression, as that would be strictly illegal. Passing float a[4] means that you are passing a pointer(!) which isn't even aligned to a 16 byte boundary. Still true that most compilers fail at this type of optimization, even if the code is adapted to actually make the optimization legal.

https://godbolt.org/z/-Hq43G is the closest I have gotten without directly using SIMD intrinsics (the __m128 parameters are necessary, as it is the only way to get the values passed in via registers).



Hi,

You are mistaken, that's not illegal (why would it be?) and you got that wrong as it WAS turned into SIMD, but with scalar operations (mm_mul_ss for the floats), and the compiler haven't used the built-in dot product intrinsics.

Nope, it won't be a pointer. That would be *float. That many floats can be easily passed in registers, even if all elements are passed in separate registers; and at least gcc for SysV ABI with those optimization flags compiled that code. But if they weren't, then what makes you think that a float[4] array is not 16 byte aligned? Most compilers will align that on 16 byte boundary automatically as sizeof(float[4]) is 16 (we are talking about static arrays, possibly on the stack which is also expected to be 16 byte aligned upon function entry). Also, I could have used alignment attributes, now couldn't I? Please don't jump to conclusions.

Cheers,
bzt

Hello!

Excellent ! I got useful information here. Thank you!

bzt said:
l0calh05t said:
@bzt: It is unsurprising that "somefunc" isn't turned into a SIMD expression, as that would be strictly illegal. Passing float a[4] means that you are passing a pointer(!) which isn't even aligned to a 16 byte boundary. Still true that most compilers fail at this type of optimization, even if the code is adapted to actually make the optimization legal.

https://godbolt.org/z/-Hq43G is the closest I have gotten without directly using SIMD intrinsics (the __m128 parameters are necessary, as it is the only way to get the values passed in via registers).

Hi,

You are mistaken, that's not illegal (why would it be?) and you got that wrong as it WAS turned into SIMD, but with scalar operations (mm_mul_ss for the floats), and the compiler haven't used the built-in dot product intrinsics.

Nope, it won't be a pointer. That would be *float. That many floats can be easily passed in registers, even if all elements are passed in separate registers; and at least gcc for SysV ABI with those optimization flags compiled that code. But if they weren't, then what makes you think that a float[4] array is not 16 byte aligned? Most compilers will align that on 16 byte boundary automatically as sizeof(float[4]) is 16 (we are talking about static arrays, possibly on the stack which is also expected to be 16 byte aligned upon function entry). Also, I could have used alignment attributes, now couldn't I? Please don't jump to conclusions.

Cheers,
bzt

float x[4] when passed as a function parameter in C or C++ is always actually a pointer by definition/specification (independent of the ABI)! Also, "most compilers will align float[4] to 16 bytes" is not true. It may do so if it is a stack variable as an optimization, but in an array has exactly the same alignment as its elements. And you can't just ignore ABI requirements. On both Windows and Linux ABIs a struct containing four floats (whether as scalars or as an array) will NOT be passed in registers. Obviously, none of this does not applies to inlined functions because they are in the same compilation unit, or because you have LTO enabled as ABIs and how parameters are passed are no longer directly relevant.

Also I wouldn't call mulss SIMD considering that it both Linux and Windows ABIs on x64 specify passing of floats in xmm registers.

This topic is closed to new replies.

Advertisement