Advertisement

3 quick ways to calculate the square root in c++

Started by October 22, 2019 01:17 AM
37 comments, last by l0calh05t 4 years, 11 months ago

Furthermore, there may be a good reason that the compiler's heuristics do not select vdpps: https://unix4lyfe.org/vdpps-is-slow/

l0calh05t said:

Furthermore, there may be a good reason that the compiler's heuristics do not select vdpps: https://unix4lyfe.org/vdpps-is-slow/



I can recall, that I benchmarked the dot product intrinsics vs a handwritten version. If I remember correctly, the intrinsics were faster. Maybe I'll redo that later and post the results ;)

However, at least if you are messing around with intrinsics, I think one should avoid calculating single dot products whenever possible. I always try to calculate multiple dot products at once to take maximal advantage of vectorization.

Greetings

Advertisement

"float x[4] when passed as a function parameter in C or C++ is always actually a pointer by definition/specification (independent of the ABI)!"
"On both Windows and Linux ABIs a struct containing four floats (whether as scalars or as an array) will NOT be passed in registers."

You are wrong about this. Please read https://www.uclibc.org/docs/psABI-x86_64.pdf Section 3.2.3 Parameter Passing:

SSE The class consists of types that fit into a vector register.
MEMORY This class consists of types that will be passed and returned in mem-ory via the stack.

Arguments of types float,double,Decimal32,Decimal64 and __m64 are in class SSE.
The classification of aggregate (structures and arrays) and union types works as follows: 1. If the size of an object is larger than four eightbytes, or it contains unaligned fields, it has class MEMORY.
3. If the class is SSE, the next available vector register is used, the registers are taken in the order from %xmm0 to %xmm7.

A foot note on page 18 clearifies that even double[4] arrays can be passed in registers on modern processors: This in turn ensures that for processors that do support the__m256type, if the size of an object is four eightbytes and the first eightbyte is SSE and all other eightbytes are SSEUP, it can be passed in a register.

Cheers,
bzt

The SysV ABI is entirely irrelevant to the first point, as that is part of the C (and C++) language definition! To quote section 8.3.5 of the C++ Standard (the C standard has a similar clause):

After determining the type of each parameter, any parameter of type “array of T” or “function returning T” is adjusted to be “pointer to T” or “pointer to function returning T,” respectively.

You can even see it in the compiler output of the float dot(float a[4], float b[4]) function on both GCC (Linux) and MSVC (Windows):

https://godbolt.org/z/wdvB87
hhttps://godbolt.org/z/kU-UuY

W.r.t. the second point I will concede that I was wrong about Linux (to which the SysV ABI applies), but not about Windows (to which the SysV ABI does not apply). See the output in the above links. Everything is passed via memory. Even with the "new" __vectorcall ABI, the first struct is passed in xmm0-xmm3 as individual floats, but not as a vector and the second one remains in memory. Note also that neither compiler aligns the struct to a 16-byte boundary (and doing so would break ABI compatibility).

However, at least if you are messing around with intrinsics, I think one should avoid calculating single dot products whenever possible. I always try to calculate multiple dot products at once to take maximal advantage of vectorization.

Reiterating from the comment several pages ago, that seems to be the common issue in this entire discussion.

Micro-optimizations do have their place. There was an era where square root times were a significant bottleneck in some code, particularly in graphics and physics code. However, these days the bottlenecks are generally elsewhere.

On modern hardware the biggest bottlenecks tend to be caches and keeping the CPU and GPU fed. I haven't noticed square roots as a blip in a profiler for nearly two decades. There is so much asynchronous processing and out-of-order processing internally to the chips that the individual operations aren't blocking.

Better data batching and broad algorithmic changes will give several orders of magnitude of improvements versus a micro-optimization tuning a single command very specific single chip, single compiler, single ABI.

W.r.t. the second point I will concede that I was wrong about Linux (to which the SysV ABI applies), but not about Windows (to which the SysV ABI does not apply)

Correction, I was only partially wrong. The vec4 structs are passed in xmm* registers... as individual floats, not as a vector!

Advertisement
The SysV ABI is entirely irrelevant to the first point, as that is part of the C (and C++) language definition!

Another part of the C language definition is that the generated code has to behave "as if" it were following the original code. One of the things that means in practice is that the compiler does a pointer propagation analysis pass and if it detects that the pointer is not written to (and that includes an indexed offset to the pointer) then it's treated as a dereferenced const rvalue instead. In other words, the compiler can emit code to pass the float[4] in an f128 register instead of as a pointer on the stack, and still be conforming to the language standard. The ABI explicitly allows this as well, by specifying which registers can be clobberd and which must be saved on context switches.

Stephen M. Webb
Professional Free Software Developer

Bregma said:
The SysV ABI is entirely irrelevant to the first point, as that is part of the C (and C++) language definition!

Another part of the C language definition is that the generated code has to behave "as if" it were following the original code. One of the things that means in practice is that the compiler does a pointer propagation analysis pass and if it detects that the pointer is not written to (and that includes an indexed offset to the pointer) then it's treated as a dereferenced const rvalue instead. In other words, the compiler can emit code to pass the float[4] in an f128 register instead of as a pointer on the stack, and still be conforming to the language standard. The ABI explicitly allows this as well, by specifying which registers can be clobberd and which must be saved on context switches.



While the "as if" rule allows this within a compilation unit (or when using LTO/LTCG) - which I already said a few posts ago - this cannot be done across normal linking boundaries due to nasty little things like const_cast. Also pointer aliasing will make this impossible in many, many cases.

This topic is closed to new replies.

Advertisement