Advertisement

3 quick ways to calculate the square root in c++

Started by October 22, 2019 01:17 AM
37 comments, last by l0calh05t 4 years, 11 months ago

 

 

Hi, this is my first forum and I want to do it: quick way to calculate the square root in c ++ with floating data types. These types of functions are very useful to gain some CPU time, especially when used continuously. I will show you 3 similar functions and indicate the advantages and disadvantages of each of them. The last of these three functions was written by me. If you notice that the writing is a bit out of grammar, it is because I do not speak English and I am using a support tool. My native language is Spanish. Well, let's start:


The First method is very famous was used in the video game quake III arena and you can find a reference in Wikipedia as: :https://en.wikipedia.org/wiki/Fast_inverse_square_root.

 

The Function was optimized for improvements in computing times.


float sqrt1(const float &n) 
{
   static union{int i; float f;} u;
   u.i = 0x5F375A86 - (*(int*)&n >> 1);
   return (int(3) - n * u.f * u.f) * n * u.f * 0.5f;
}

 

-Advantages:

* When Root of 0 is calculated the function returns 0.

* The convergence of the function is acceptable enough for games.

* It generates very good times.

* The Reciprocal of the root can be calculated by removing the second “n” from the third line. According to the property of: 1 / sqrt (n) * n = sqrt (n).

-Disadvantages:

* Convergence decreases when the root to be calculated is very large.

 

The second method is not as famous as the first. But it does the same function calculate the root.



float sqrt2(const float& n)
{
   union {int i; float f;} u;
   u.i = 0x1FB5AD00 + (*(int*)&n >> 1);
   u.f = n / u.f + u.f;
   return n / u.f + u.f * 0.25f;
}

 

-Advantages:

* The convergence of the function is high enough to be used in applications other than games.

-Disadvantages:

* Computing times are much larger.

* The square root of “0” is a number very close to “0” but never “0”.

* The division operation is the bottleneck in this function. because the division operation is more expensive than the other arithmetic operations of Arithmetic Logic Units (ALU).

 

The third method takes certain characteristics of the two previous methods.


float sqrt3(const float& n)
{
   static union {int i; float f;} u;
   u.i = 0x2035AD0C + (*(int*)&n >> 1);
   return n / u.f + u.f * 0.25f;
}

 

Advantages:

* The convergence of the function is greater than that of the first method.

* Generates times equal to or greater than the first method.

Disadvantages:

* The square root of “0” is a number very close to “0” but never “0”.

 

 

The 3 previous methods have something in common.

They are based on the definition of the Newton-Raphson Method. according to the function of the square root > f (x) = x ^ 2 - s.

 

well thanks to you for reading my forum.

well thanks to you for reading my forum.

Nice write up of some classic tricks!

I would strongly advise against using these today, though.

Quote

quick way to calculate the square root in c ++

These are all undefined behavior in C++.

Type punning through a union is allowed in C but not allowed in C++.

(For a historical note, SGI's/Carmack's original implementation didn't use a union and was also UB even in C.)

The only way to safely do type punning in C++ today is to use std::memcpy. In the upcoming C++20 you'll also be able to use std::bit_cast, which is basically just a wrapper around memcpy, though hypothetically compilers will use intrinsics to optimize bit_cast more reliably than their existing memcpy substitution optimizations.

Quote


static union{int i; float f;} u;
 

 

Using a `static` makes these thread unsafe.

If two threads both try to call these functions at the same time then bad things will happen.

Quote

These types of functions are very useful to gain some CPU time

These are all slower and less accurate on all modern CPUs than just using the 20-year-old Intel rsqrtss instruction or the ARM NEON equivalent. 

This is both because a dedicated instruction is faster than a series of data-dependent instructions, and also because moving values between an SSE scalar register and a GPR integer register has unnecessary overhead.

Keep your floats in SSE registers and use scalar SSE intrinsics, which the compiler will do for you for the most part.

Unfortunately there is no rsqrt in either C or C++ to portably expose the behavior of the rsqrtss instruction or equivalents on non-Intel architectures, but writing one that selectively uses the correct intrinsics is easy enough.

 

Sean Middleditch – Game Systems Engineer – Join my team!

Advertisement

Very good information and also useful in its entirety. You are right, the "static" member avoids continuous creation
of the union in exchange for speed but also completely limits the function. On the other hand,
I like being able to see a function written in c ++ on how to calculate
the square root in a more efficient way as you just described.
I would really like to see that function please.

Interesting.

But what's wrong with using std::sqrt() ?

Do modern processors have a hardware accelerated version ?

 

1 hour ago, Green_Baron said:

But what's wrong with using std::sqrt() ?

Do modern processors have a hardware accelerated version ?

 

Check the latencies on the link. This is for vectorized code, but I guess, the results are quite similar for serial instructions. `_mm_sqrt_ps` is the standard square root while `_mm_rsqrt_ps` calculates the reciprocal square root using an approximation formula as described by the TO. It is 3.25 times faster (on a skylake processor) but less accurate. If you perform a lot of square root calculations, this might be beneficial.

However, there are many situations where you can simply avoid calculating the square root. For example, when you want to know which vectors length is bigger, you can simply compare the squared lengths instead of calculating the length using the square root. No square root is faster than a fast square root ?

You should generally avoid fast square roots if the result is used in complex subsequent calculations (physics simulations, linear solvers like LLT or QR). The error can lead to unwanted numerical instabilities, especially if you are only using 32bit floats.  However, if accuracy is not an issue, you might get a performance boost. I am not sure if it really matters since the relative number of square root calculations per frame shouldn't be too high, but that depends on your game/program.

Greetings

 

I see. So its the usual tradeoff between speed and accuracy.

Yeah, i am aware. I don't use sqrt extensively for example in frustum and intersection tests. Was that grammatically correct ? :-)

Advertisement
Spoiler

 

The third method can also be written this way:


float sqrt3(const float& n)
{
   int i = 0x2035AD0C + (*(int*)&n >> 1);

   return n / *(float*)&i + *(float*)&i * 0.25f;
}

the static member was removed from the function as a quote from: SeanMiddleditch.

On 10/24/2019 at 12:34 AM, gustavo rincones said:
  Reveal hidden contents

 

The third method can also be written this way:



float sqrt3(const float& n)
{
   int i = 0x2035AD0C + (*(int*)&n >> 1);

   return n / *(float*)&i + *(float*)&i * 0.25f;
}

the static member was removed from the function as a quote from: SeanMiddleditch.

No it can't. That code breaks strict aliasing rules and is 100% undefined behavior ;) 

Also, the following three all produce the same code on gcc, clang and msvc with optimizations enabled (but only one is actually legal)


#include <string.h>
float sqrt1(float n)
{
    int ni;
    memcpy(&ni, &n, sizeof(n));
    int ui = 0x2035AD0C + (ni >> 1);
    float u;
    memcpy(&u, &ui, sizeof(u));
    return n / u + u * 0.25f;
}

float sqrt2(float n)
{
   union {int i; float f;} u; // no need for the static!
   u.i = 0x2035AD0C + (*(int*)&n >> 1);
   return n / u.f + u.f * 0.25f;
}

float sqrt3(float n)
{
   int i = 0x2035AD0C + (*(int*)&n >> 1);

   return n / *(float*)&i + *(float*)&i * 0.25f;
}

 

1 hour ago, l0calh05t said:

Also, the following three all produce the same code on gcc, clang and msvc with optimizations enabled (but only one is actually legal)

Its almost like compiler-writers know about these kinds of optimizations :D

But seriously, on one hand its awesome that compilers from being able to optimize usage of functions like sqrt (pretty much every compiler has settings for fast-math).
On the other hand, its even funnier that sometimes, compilers will figure out what extensive crap you are doing, and simply replace it with their generic optimized routine, or sometimes even with the thing you were actually trying to avoid :D (see Matt Godbolts talk from CppCon 2017 for some examples of this.

Today, "let the compiler figure it out" really applies at least for such local scope. Save optimizations for alogrithm/data-structure related matters that cannot be resolved by a simple compiler flag.

On 10/22/2019 at 5:14 AM, gustavo rincones said:

You are right, the "static" member avoids continuous creation
of the union in exchange for speed but also completely limits the function.

The "static" member also does everything but give you speed. Under optimization, both with or without it you get the same code (https://godbolt.org/z/ZBz6MM), because the compiler can figure out that you not using the value of the unoin in subsquent function call. But if itdidn't, that static would make the code much slower. You save one increment of the stack-pointer, but you pay the cost of heap-access, as well as (at least) a check on each access for multithreaded initialization.

This topic is closed to new replies.

Advertisement