Advertisement

Fast Trig function

Started by February 08, 2002 06:10 PM
47 comments, last by dragon376 23 years ago
Axter if I could afford VC++ professional I would. I'm just trying to combine my ASM/Hardware knowledge with my C++ programming efficiently. The key for me is learning what combinations of inline, fastcall, nakedcalls... are allowed.
I finally got the compiler to give me the result i wanted, well almost.
      inline float sine(float Angle){    __asm 	{        FLD Angle		FSIN        FWAIT	}}int main(int argc, char* argv[]){float d,e = 3.141592/2;//float* f=&ed=sine(e);printf("Sine 3.141592(radians) = %f \n",d);return 0;	}  


Compile this (use release, i did) and look at the at the assembly, no unnecessary stack movement. Now uncomment the pointer reference and pass the sine function *f. When i recompile and look at the disassembly this is what i get:
  ; Line 18	mov	DWORD PTR _e$[ebp], 1070141400; initilizes e; Line 19	lea	eax, DWORD PTR _e$[ebp]	mov	DWORD PTR _f$[ebp], eax ;initilizes f; Line 20	mov	ecx, DWORD PTR _f$[ebp]; THIS IS WHAT BOTHERS ME	mov	edx, DWORD PTR [ecx]	mov	DWORD PTR $T597[ebp], edx	fld	DWORD PTR $T597[ebp];    UPTO HERE; Line 21	fsin; Line 22	fwait; Line 20	fst	DWORD PTR _d$[ebp]      

Where i say this is what bothers me, why didn't the compiler just go: FLD dword ptr [EAX]. This is an extremely simple example but its inlined so why the extra memory references when the pointer is already in EAX? Do u get the same output with VC++ Professional edition? Now do u see why I even mentioned it?

Edit - A simple sin function could be called quite often, so saving some cycles in overhead is worth it at least IMO. The main problem i have is that there is a write to memory immediately followed by a read to the same location and this happens not once but twice in a row. My next experiments will be if the angle is stored in a struct and making another inline FPU supported function. Then I will see what happens if i call do some thing like sqrt(sine(x)); does the compiler unnecessarily move to memory and back? Axter if i write some source, could u post the disassembly and cycle counts or just tell me if you get the superfluous memory access's?

-potential energy is easily made kinetic-

Edited by - Infinisearch on February 13, 2002 3:21:33 AM

-potential energy is easily made kinetic-

Infinisearch,

I had to modify your asm function, as below. I provide my main function, as well as it’s disassembly. This is with all optimizations on. You should not look too close at debug code, as there was no attempt made to optimize it by the compiler.

    inline float sine(float Angle){      float fAns;   __asm 	  {            FLD   Angle		    FSIN    FSTP  [fAns]    FWAIT	  }   return fAns;}  int main(void){  float d, e = 3.141592f / 2.0f;     float* f = &e  d = sine(e);   printf("Sine 3.141592(radians) = %f \n", d);     return 0;}  499:  int main(void)500:  {00402AA0   push        ebp00402AA1   mov         ebp,esp00402AA3   sub         esp,8501:502:    float d, e = 3.141592f / 2.0f;00402AA6   mov         dword ptr [e],3FC90FD8h503:504:    float* f = &e505:    d = sine(e);00402AAD   fld         dword ptr [e]506:    printf("Sine 3.141592(radians) = %f \n", d);00402AB0   fsin507:00402AB2   fstp        dword ptr [ebp-8]508:    return 0;00402AB5   wait506:    printf("Sine 3.141592(radians) = %f \n", d);00402AB6   fld         dword ptr [ebp-8]00402AB9   sub         esp,800402ABC   fstp        qword ptr [esp]00402ABF   push        offset t+8Ch (0041124c)00402AC4   call        _printf (00403b39)00402AC9   add         esp,0Ch508:    return 0;00402ACC   xor         eax,eax509:  }00402ACE   mov         esp,ebp00402AD0   pop         ebp00402AD1   ret     


SS


Edited by - Axter on February 13, 2002 3:34:24 AM
SS
Advertisement
quote:
Original post by Axter

      return fRad - (fR3 * fF3i) + (fR5 * fF5i) - (fR7 * fF7i) + (fR9 * fF9i);    



It's an infinite series, and you've only calculated the first five terms. It'll be good enough for small angles, but at around sin(pi), it's already lost accuracy. And once you get to sin(pi * 3/2), you might as well forget it. And what if someone wants to calculate sin(34934)?

IMO, your best bet is to wrap the input to be between -pi and pi, and add one more term. With one more term, sin(pi) is -4.5 * 10-4 (by my TI-83), so it should be accurate enough.

EDIT: Ah, you're already wrapping the input. I recommend that you use division instead of repeated addition / subtraction, though.

Edited by - Beer Hunter on February 13, 2002 5:40:03 PM
quote:
Original post by Beer Hunter

EDIT: Ah, you''re already wrapping the input. I recommend that you use division instead of repeated addition / subtraction, though.




Well, I figured that in most cases the radians passed in would be close to (if not in) the range -pi to pi, so doing one or two while loops would be faster than division. But it depends on the application. If you expect large radian values often, then you should modify the range checking part to the best method.

I can add another term, I don''t think it''ll make it too much slower. Depends on what accuracy one needs.

SS

SS
I''ve worked on your function a bit. Here''s a new version:
  inline float SinFormula(float fRad) {    const float fPi  = 3.141592654f;    const float fF3i = 1.0f / 6.0f;    const float fF5i = 1.0f / 120.0f;    const float fF7i = 1.0f / 5040.0f;    const float fF9i = 1.0f / 362880.0f;    const float fF11i = 1.0f / 39916800.0f;#ifdef DEBUG    if(fRad > fPi || fRad < -fPi) {        fprintf(stderr, "Please clip the input to (-pi..pi), you lazy, deluded fool.\n");    }#endif    float fR2 = fRad * fRad;    float fR3 = fR2  * fRad;    float fR5 = fR2  * fR3;    float fR7 = fR2  * fR5;    float fR9 = fR2  * fR7;    float fR11 = fR2 * fR9;    return fRad - (fR3 * fF3i) + (fR5 * fF5i) - (fR7 * fF7i) + (fR9 * fF9i) - (fR11 * fF11i);}  
1)Surely any decent compiler has sin() and cos() calls as inline functions?
2)Is the thing about sin/cos being as quick as a table just a myth - it seems so few people can be bothered to test it so I''ll have a go and get back to you all...
3)Does 3dNow/SSE have a SIMD format for trig functions?

I don''t think we need to worry about compilers being able to deal with modern processors when talking about sin/cos - they''ve both been around since the 386!
Advertisement
OK, I did some interesting tests…

So you say you want fast?

For those that think VC++ can’t optimize code well, check this out.

Since the sine formula is an infinite formula with terms that expand in a known way, I re-wrote the formula so that each term is calculated by recursively calling a second auxiliary function. In this way, you can decide how many terms you want (and therefore accuracy) in any particular case. The SineRecurse function takes a second parameter after the radians value that specifies the number of terms required.

Here are the two function:

  #pragma inline_recursion(on)#pragma inline_depth(50) // auxiliary functioninline float SineRecurseB(float fAns, float fRad, const float fRad2, float fRadPow,                                  float fFact, const int nPrec, int nCur){	fRadPow *= -fRad2;	fFact   *= nCur++;	fFact   *= nCur++; 	fAns += fRadPow / fFact; 	if(nCur < nPrec)		return SineRecurseB(fAns, fRad, fRad2, fRadPow, fFact, nPrec, nCur);	else		return fAns;}  // Formula: angle = rad - rad^3 / 3! + rad^5 / 5! - rad^7 / 7! + rad^9 /9!...inline float SineRecurse(float fRad, int nPrec){	const float fRad2 = fRad * fRad;	float fAns = fRad; 	return SineRecurseB(fAns, fRad, fRad2, fRad, 1.0f, (nPrec << 1) + 1, 2);}  


Now here is the interesting part. In Release mode, if I pass in a constant value for both the radian value as well as the number of terms, VC++ COMPLETELY optimizes the function away! It is the same as simply assigning the actual result to a variable! So, the following two lines would result in EXACTLY the same compiled code:

// both these lines result in same output from compiler
float fResult1 = SineRecurse(1.23f, 10); // same as…
float fResult2 = 0.942489f;

In fact, I can specify a value for the term count of up to 16, and it will still come up with the actual final result. After I specify 17, the function actually gets inlined.

Now, it’s also important to note that you actually need to do SOMETHING with the result, because otherwise the compiler will simply remove everything, as it figures the result is not required at all. So what I do is multiply the result with 100 000 and return it as the result from the main function, like so:

  int main(void){  return int(100000.0f * SineRecurse(1.23f, 16));}   


This results in the following asm output:

  574:  int main(void)575:  {00402AA0   mov         eax,17028h576:    return int(100000.0f * SineRecurse(1.23f, 16));577:  }00402AA5   ret  


Note that 17028h is 94248 which is sin(1.23)x100 000. (the simple int conversion does not round up properly, but the floating point result is correct anyway, I have checked…)

The following function:

  int main(void){  return int(100000.0f * sin(1.23));}  


…produces the output:
  574:  int main(void)575:  {00402AA0   fld         qword ptr [__real@8@3fff9d70a3d70a3d7000 (004101a8)]00402AA6   fsin00402AA8   fmul        qword ptr [__real@8@400fc350000000000000 (004101a0)]00402AAE   jmp         __ftol (00403ed8)--- No source file  ---------------------------------------------------------------------------------------------------------------00402AB3   nop00402AB4   nop00402AB5   nop00402AB6   nop00402AB7   nop00402AB8   nop00402AB9   nop00402ABA   nop00402ABB   nop00402ABC   nop00402ABD   nop00402ABE   nop00402ABF   nop-------------------------------576:    return int(100000.0f * sin(1.23));577:  }  


Don’t ask me to explain that, I don’t know what’s going on there…

Now, if you do not pass in a constant for the radian parameter, the compiler of course cannot return a final result, but it’s possible to completely inline the recursive function up to 10 when specifying just “inline”, but when using the MS specific “__forceinline”, it will expand it up to whatever you like. I tested it up to a recurse depth of 50, and it still inlined. Of course, at 50 terms the variables have long since overflowed (what’s 101 factorial again…?).

Here is the code from an inline depth of 8: (note I used timeGetTime() to get a variable to pass in, else the compiler would optimize everything away…)
  576:  int main(void)577:  {00402AA0   sub         esp,0Ch578:    float fRad = float(timeGetTime() % 100) / 50.0f;00402AA3   call        dword ptr [__imp__timeGetTime@0 (004100f4)]00402AA9   xor         edx,edx00402AAB   mov         ecx,64h00402AB0   div         eax,ecx00402AB2   mov         dword ptr [esp+8],000402ABA   mov         dword ptr [esp+4],edx00402ABE   fild        qword ptr [esp+4]00402AC2   fmul        dword ptr [__real@4@3ff9a3d70a3d70a3d800 (004101c4)]579:580:    return int(100000.0f * SineRecurse(fRad, 8));00402AC8   fld         st(0)00402ACA   fmul        st,st(1)00402ACC   fstp        dword ptr [esp]00402AD0   fld         st(0)00402AD2   fmul        dword ptr [esp]00402AD6   fchs00402AD8   fst         dword ptr [esp+4]00402ADC   fmul        dword ptr [__real@4@3ffcaaaaaaaaaaaaa800 (004101c0)]00402AE2   fxch        st(1)00402AE4   faddp       st(1),st00402AE6   fld         dword ptr [esp+4]00402AEA   fmul        dword ptr [esp]00402AEE   fchs00402AF0   fst         dword ptr [esp+4]00402AF4   fmul        dword ptr [__real@4@3ff88888888888888800 (004101bc)]00402AFA   fxch        st(1)00402AFC   faddp       st(1),st00402AFE   fld         dword ptr [esp+4]00402B02   fmul        dword ptr [esp]00402B06   fchs00402B08   fst         dword ptr [esp+4]00402B0C   fmul        dword ptr [__real@4@3ff2d00d00d00d00d000 (004101b8)]00402B12   fxch        st(1)00402B14   faddp       st(1),st00402B16   fld         dword ptr [esp+4]00402B1A   fmul        dword ptr [esp]00402B1E   fchs00402B20   fst         dword ptr [esp+4]00402B24   fmul        dword ptr [__real@4@3fecb8ef1d2ab639a000 (004101b4)]00402B2A   fxch        st(1)00402B2C   faddp       st(1),st00402B2E   fld         dword ptr [esp+4]00402B32   fmul        dword ptr [esp]00402B36   fchs00402B38   fst         dword ptr [esp+4]00402B3C   fmul        dword ptr [__real@4@3fe5d7322b3faa272000 (004101b0)]00402B42   fxch        st(1)00402B44   faddp       st(1),st00402B46   fld         dword ptr [esp+4]00402B4A   fmul        dword ptr [esp]00402B4E   fchs00402B50   fst         dword ptr [esp+4]00402B54   fmul        dword ptr [__real@4@3fdeb092309d43684800 (004101ac)]00402B5A   fxch        st(1)00402B5C   faddp       st(1),st00402B5E   fld         dword ptr [esp+4]00402B62   fmul        dword ptr [esp]00402B66   fchs00402B68   fst         dword ptr [esp+4]00402B6C   fmul        dword ptr [__real@4@3fd6d73f9f399dc0f800 (004101a8)]00402B72   fxch        st(1)00402B74   faddp       st(1),st00402B76   fld         dword ptr [esp+4]00402B7A   fmul        dword ptr [esp]00402B7E   fchs00402B80   fmul        dword ptr [__real@4@3fceca963b81856a5000 (004101a4)]00402B86   fxch        st(1)00402B88   faddp       st(1),st00402B8A   fmul        dword ptr [__real@4@400fc350000000000000 (004101a0)]00402B90   call        __ftol (00403fd8)581:  }00402B95   add         esp,0Ch00402B98   ret  

Here are some clock cycles for various cases:

AMD 1.33GHz
sin(): ~86 cycles
SinRecurse() ~ 22 cycles (with 4 terms)
SinRecurse() ~ 34 cycles (with 6 terms)
SinRecurse() ~ 64 cycles (with 8 terms)
SinRecurse() ~ 85 cycles (with 10 terms)

PentiumIII 866MHz
Sin(): Not tested
SinRecurse() ~ 41 cycles (with 4 terms)
SinRecurse() ~ 60 cycles (with 6 terms)
SinRecurse() ~ 82 cycles (with 8 terms)
SinRecurse() ~ 101 cycles (with 10 terms)

Feel free to come up with a better recursive implementation; I did not spend a lot of time looking for ways to optimize it. It’s interesting to note that, although there is a division in the recurse function (for the factorial), there is no division in the final code, as the compiler pre-calculated the inverse of the factorials, even after I looked at a recursive depth of 50.

So there you have it. Again, make of that what you will…

SS


Edited by - Axter on February 16, 2002 3:44:00 PM
SS
So, MSVC can identify what values can be determined at compile time. I''ve got an ancient pascal compiler that optimises "x := 56 * 23 - (67 div 45);" into "x := 1287;". MSVC takes the idea further than that, but it''s still a pretty trivial optimisation. Once the function has been expanded inline, it''s not unlike this:
  int T = 5;if(T < 60) {    T *= 2;} else {    T *= 3;}  

I would expect any modern compiler to optimise that.

For speed measurements, try inputting a number from the console and putting that number through SinRecurse(), SinFormula(), and if you want, SinIterate().
Looking cool.
Seems to me that Axter has halved the library sine func.
That so ?
Thats really nice.

Bugle4d

Edited by - Vlion on February 16, 2002 9:55:26 PM
~V'lionBugle4d

This topic is closed to new replies.

Advertisement