Fast Trig function

dragon376 · 2002-02-16T20:55:26

Anybody has a trick get fast trig function? I tried to have a array of 360 degrees pre calculated, but for many application, averaging to the closest degree isn''t precice enough. And the C function sin() and cos() are kinda slow.

Math and Physics Programming

Started by dragon376 February 08, 2002 06:10 PM

47 comments, last by dragon376 23 years ago

Beer Hunter

712

February 11, 2002 04:22 PM

Vlion - In radians:

sin(x) = x - x³/3! + x⁵/5! - x⁷/7! + x⁹/9! - ...
cos(x) = 1 - x²/2! + x⁴/4! - x⁶/6! + x⁸/8! - ...
tan(x) = sin(x) / cos(x)

Have fun...

Axter

122

February 11, 2002 04:43 PM

quote:
Original post by Beer Hunter
Vlion - In radians:

sin(x) = x - x³/3! + x⁵/5! - x⁷/7! + x⁹/9! - ...
cos(x) = 1 - x²/2! + x⁴/4! - x⁶/6! + x⁸/8! - ...
tan(x) = sin(x) / cos(x)

Have fun...

Yea, good luck trying to make THAT run faster than 85 cycles...

SS

Vlion

151

February 11, 2002 10:35 PM

-
Vlion - In radians:

sin(x) = x - x3/3! + x5/5! - x7/7! + x9/9! - ...
cos(x) = 1 - x2/2! + x4/4! - x6/6! + x8/8! - ...
tan(x) = sin(x) / cos(x)

Have fun...
-
I assume:
the sequence is all odds;
It is a infininite sequence that has a limit
That it is generalized as....

xn - x(n+2)/(n+2)! - x(n+4)/(n+4)!...
for cos(x)

and similar for sin.

Given time, perhaps someone can get a optimization implemented.

Does anyone know if the Intel x86 to P4 architecture have a sin and cos intruction ?

Bugle4d

~V'lionBugle4d

Infinisearch

3,058

February 11, 2002 10:52 PM

I tried the math.h library and single stepped throught the sin function and there is some extra crap... So try beerhunters with only sin(). But use a register call convention, if your compiler supports it. Also if the sine function is being called from a floating point intensive section i would suggest inspecting the disassembly of the section as a whole. If there is a bunch of stack manipulation that seems unnecessary hand code the whole section in assembly to remove unnecessary pushes and pops.

-potential energy is easily made kinetic-

-potential energy is easily made kinetic-

Axter

122

February 11, 2002 11:51 PM

quote:
Original post by Vlion
Does anyone know if the Intel x86 to P4 architecture have a sin and cos intruction ?

Yes, it does.

OK, I just tested the straight sin formula, compared it to the x86 fsin call, as well as BeerHunter’s asm version. I list the code below, as well as the results.

Note that, unfortunately, the range checking code slows it down, see the comment.

Here’s the code – feel free to optimize it…

      inline float SinFormula(float fRad) {      // Formula: angle = rad - rad^3 / 3! + rad^5 / 5! - rad^7 / 7! + rad^9 / 9!...    const float fPi  = 3.141592654f;  const float fF3i = 1.0f / 6.0f;  // Inverse factorial  const float fF5i = 1.0f / 120.0f;  const float fF7i = 1.0f / 5040.0f;  const float fF9i = 1.0f / 362880.0f;    while(fRad > fPi)    fRad -= fPi;      while(fRad < -fPi)    fRad += fPi;   const float fR2 = fRad * fRad;  const float fR3 = fR2  * fRad;  const float fR5 = fR2  * fR3;  const float fR7 = fR2  * fR5;  const float fR9 = fR2  * fR7;   return fRad - (fR3 * fF3i) + (fR5 * fF5i) - (fR7 * fF7i) + (fR9 * fF9i);}

Note: The compiler pre-calculates the inverse factorial constants, so there is no division in the final code.

Results:

x86 fsin: ~84 cycles
BeerHunter: ~148 cycles
Formula: ~29 cycles
Formula: ~21 cycles (no range checking)

All functions were inlined, and the disassembly was checked to make sure there was no funny stuff going on…

Again, make what you will of that. The formula with just 4 terms seems to be accurate to about 5 to 6 decimal places, quite amazing, if you ask me.

The only thing though: You will probably use other floating point math around the sin call, so the formula method could end up being slower than the fsin method because it probably ties up more fp registers.

BTW, this was run on AMD 1.33GHz, not Intel, so I don’t know how this would work on Intel.

SS

Edited by - Axter on February 11, 2002 12:52:19 AM

Edited by - Axter on February 11, 2002 12:56:41 AM

Infinisearch

3,058

February 12, 2002 02:58 AM

Axter, could you could post the results if u pass a pointer to a float instead of the float and use a register (fast) call convention (still inlined)? Also if u could reduce the FPU internal precision to single precision, it would be interesting to see if that cuts down cycles.

// With fast call: parameters are passed in ECX or EDX// If ECX don''t work try EDXinline void __fastcall Sine(float* fRad) {_asm{FLD dword ptr [ECX] FSINFSTP dword ptr [ECX]FWAIT}}// Same thing with seperate pointer for outputinline void __fastcall Sine(float* fRad,float* output) {_asm{FLD dword ptr [ECX] FSINFSTP dword ptr [EDX]FWAIT}}

Lastly take a look at the Approximate math library on this page, i haven''t looked at it myself but intel is claiming faster than x86 FPU:
http://developer.intel.com/design/pentiumiii/devtools/

-potential energy is easily made kinetic-

-potential energy is easily made kinetic-

Axter

122

February 12, 2002 01:56 PM

Infinisearch...

I did a quick test on your code, but I could not get the second one to work. How do I pass back the result of the calculation? Sorry, I’m not an asm guru.

The first one works in debug mode, but causes the app to crash during execution in release mode. I did not have enough time to see what was causing this.

But I’m curious… How can the calling convention help if the function is inlined anyway? The compiler usually does a bunch of optimization that would usually end up with the same code in both cases.

So, passing a float pointer as opposed to a float value will probably also end up with the same code, after optimization if the functions are inlined.

Also, I looked at the link you gave for the Intel library. I might test this in a few days or so, but I was under the impression that switching between MMX and non-MMX mode was costly, something like 40 cycles or so, so I don’t know how it could speed things up at all.

Maybe it’s got to do with precision. The code I gave is probably only accurate to 5 or 6 decimal places, while the fsin and MMX versions are double precision.

SS

Infinisearch

3,058

February 12, 2002 05:39 PM

quote:
How can the calling convention help if the function is inlined anyway? The compiler usually does a bunch of optimization that would usually end up with the same code in both cases.

U can't move a value from an integer register like EAX to an FPU register directly (also if u didn't already know the x86 FPU is stack based, all operations affect only the top of the stack). You have to move the value from memory first then load it into the FPU. Nowadays u should be able to accomplish this with MMX instructions (MOVD) since the MMX register are the FPU registers but I don't know if the compiler does this. When I viewed the disassembly of Beer Hunters code there is still a whole lot of stack manipulation right before and at the beginning of the call. At least i got it (i have VC++ introductory), so i figured using the fast call convention would get rid of the unnecessary memory references. By passing a pointer in a register you should be able to get rid of alot of stack manipulation. All the caller has to do is make sure the pointer is in the right register. By virtue of eliminating unnecessary references to memory you should see a speed up. But it really depends on the context of which u call the code, If the angle is stored away in a struct then what i'm saying should speed things up, but if u calculate the angle in radians right before u call the sin function the angle is already in the right position in the FPU so that u can execute a just execute an FSIN. That is why i suggested hand coding the whole section in assembly. The problem is that u don't know how the compiler handles these different situations unless u sit there an view the disassembly of the code in different situations, in both debug and release mode, which is what i'm about to do.

The VC++ DOC's it says with the fast call convention, the first two parameters are passed on in ECX and EDX. I disassembled the release version of a simple program running my code and its not putting the pointer value in ECX. I'm not a VC++ expert, and I only have the introductory edition but i'll try to fix it. I'm going by what the documentation says, I also read that if u have a function returning a float leave it in ST(0) (top of FP register stack). So instead of my second example try this:

inline float __fastcall Sine(float* fRad)
{
_asm
{
FLD dword ptr [ECX]
FSIN
FWAIT
}
}

Edit - the above code only works if u remove the inline keyword, if u leave the inline the function won't return a value.
Edit#2 - it worked once in a very simple program w/o the inline then i made it a little more complex and now it don't work.

Anyway the optimizations section is greyed out in my VC++ so maybe the introductory edition doesn't optimize which is why i was seeing the extra stack code. In which case u are right, the fast call convention wouldn't provide any extra performance over just inlining on an optimizing compiler. The main thing i wanted u to try however was switching the internal precision, i think the FSIN will happen in 16 cycles (on a P6 based core).

-potential energy is easily made kinetic-

Edited by - Infinisearch on February 12, 2002 7:38:42 PM

-potential energy is easily made kinetic-

Axter

122

February 12, 2002 07:01 PM

quote:
Original post by Infinisearch
Anyway the optimizations section is greyed out in my VC++ so maybe the introductory edition doesn''t optimize which is why i was seeing the extra stack code. In which case u are right, the fast call convention wouldn''t provide any extra performance over just inlining on an optimizing compiler. The main thing i wanted u to try however was switching the internal precision, i think the FSIN will happen in 16 cycles (on a P6 based core).

As far as I know the editions of VC++ lower than Professional do not do optimization. I think this is to prevent companies buying the cheaper versions and doing production code with it.

Also, I remember reading somewhere that the compilers are pretty good these days at optimizing, and unless there is a really good reason, it’s not usually productive to do hand optimizing. I tend to agree. I believe that doing asm code makes the code difficult to understand, debug, maintain (try to re-write your algorithm to do something in a different way…) and just more error-prone.

I once followed an Intel tutorial from their web-site on optimizing floating point code, and tried their example code. I found that it ran almost exactly as fast as the c code equivalent.

Don’t get me wrong, there are cases where asm is better (like in DSP, image processing, etc), but for most people, spending that same time to fix other parts of the code can often result in better overall results.

I will try your new example tonight, and see what it does. You should experiment with the Professional version of VC++, as it produces pretty good results. I think in most cases you’d be hard-pressed to improve on it by even 5% to 10%.

What I can do if you want, is give you the disassembled code from a certain function (like the SinFormula I posted above), and you can see if you can improve on that. Should be interesting. Part of what makes the compiler version so fast is that it often interleaves different parts of your code to improve results, where a person would often concentrate on a specific, smaller part of the code.

SS

Newframe

122

February 12, 2002 08:06 PM

There is a shed load of MMX and 3DNow! optimised assembly code on AMD''s web site here.

In the SDK download there''s sin, cos, sincos etc etc. However I tried it a few weeks back and found almost identical timings
against plan C++ ! (I should probably admit I''m no asm guru though

)