OK, I did some interesting tests…
So you say you want fast?
For those that think VC++ can’t optimize code well, check this out.
Since the sine formula is an infinite formula with terms that expand in a known way, I re-wrote the formula so that each term is calculated by recursively calling a second auxiliary function. In this way, you can decide how many terms you want (and therefore accuracy) in any particular case. The SineRecurse function takes a second parameter after the radians value that specifies the number of terms required.
Here are the two function:
#pragma inline_recursion(on)#pragma inline_depth(50) // auxiliary functioninline float SineRecurseB(float fAns, float fRad, const float fRad2, float fRadPow, float fFact, const int nPrec, int nCur){ fRadPow *= -fRad2; fFact *= nCur++; fFact *= nCur++; fAns += fRadPow / fFact; if(nCur < nPrec) return SineRecurseB(fAns, fRad, fRad2, fRadPow, fFact, nPrec, nCur); else return fAns;} // Formula: angle = rad - rad^3 / 3! + rad^5 / 5! - rad^7 / 7! + rad^9 /9!...inline float SineRecurse(float fRad, int nPrec){ const float fRad2 = fRad * fRad; float fAns = fRad; return SineRecurseB(fAns, fRad, fRad2, fRad, 1.0f, (nPrec << 1) + 1, 2);}
|
Now here is the interesting part. In Release mode, if I pass in a constant value for both the radian value as well as the number of terms, VC++ COMPLETELY optimizes the function away! It is the same as simply assigning the actual result to a variable! So, the following two lines would result in EXACTLY the same compiled code:
// both these lines result in same output from compiler
float fResult1 = SineRecurse(1.23f, 10); // same as…
float fResult2 = 0.942489f;
In fact, I can specify a value for the term count of up to 16, and it will still come up with the actual final result. After I specify 17, the function actually gets inlined.
Now, it’s also important to note that you actually need to do SOMETHING with the result, because otherwise the compiler will simply remove everything, as it figures the result is not required at all. So what I do is multiply the result with 100 000 and return it as the result from the main function, like so:
int main(void){ return int(100000.0f * SineRecurse(1.23f, 16));}
|
This results in the following asm output:
574: int main(void)575: {00402AA0 mov eax,17028h576: return int(100000.0f * SineRecurse(1.23f, 16));577: }00402AA5 ret
|
Note that 17028h is 94248 which is sin(1.23)x100 000. (the simple int conversion does not round up properly, but the floating point result is correct anyway, I have checked…)
The following function:
int main(void){ return int(100000.0f * sin(1.23));}
|
…produces the output:
574: int main(void)575: {00402AA0 fld qword ptr [__real@8@3fff9d70a3d70a3d7000 (004101a8)]00402AA6 fsin00402AA8 fmul qword ptr [__real@8@400fc350000000000000 (004101a0)]00402AAE jmp __ftol (00403ed8)--- No source file ---------------------------------------------------------------------------------------------------------------00402AB3 nop00402AB4 nop00402AB5 nop00402AB6 nop00402AB7 nop00402AB8 nop00402AB9 nop00402ABA nop00402ABB nop00402ABC nop00402ABD nop00402ABE nop00402ABF nop-------------------------------576: return int(100000.0f * sin(1.23));577: }
|
Don’t ask me to explain that, I don’t know what’s going on there…
Now, if you do not pass in a constant for the radian parameter, the compiler of course cannot return a final result, but it’s possible to completely inline the recursive function up to 10 when specifying just “inline”, but when using the MS specific “__forceinline”, it will expand it up to whatever you like. I tested it up to a recurse depth of 50, and it still inlined. Of course, at 50 terms the variables have long since overflowed (what’s 101 factorial again…?).
Here is the code from an inline depth of 8: (note I used timeGetTime() to get a variable to pass in, else the compiler would optimize everything away…)
576: int main(void)577: {00402AA0 sub esp,0Ch578: float fRad = float(timeGetTime() % 100) / 50.0f;00402AA3 call dword ptr [__imp__timeGetTime@0 (004100f4)]00402AA9 xor edx,edx00402AAB mov ecx,64h00402AB0 div eax,ecx00402AB2 mov dword ptr [esp+8],000402ABA mov dword ptr [esp+4],edx00402ABE fild qword ptr [esp+4]00402AC2 fmul dword ptr [__real@4@3ff9a3d70a3d70a3d800 (004101c4)]579:580: return int(100000.0f * SineRecurse(fRad, 8));00402AC8 fld st(0)00402ACA fmul st,st(1)00402ACC fstp dword ptr [esp]00402AD0 fld st(0)00402AD2 fmul dword ptr [esp]00402AD6 fchs00402AD8 fst dword ptr [esp+4]00402ADC fmul dword ptr [__real@4@3ffcaaaaaaaaaaaaa800 (004101c0)]00402AE2 fxch st(1)00402AE4 faddp st(1),st00402AE6 fld dword ptr [esp+4]00402AEA fmul dword ptr [esp]00402AEE fchs00402AF0 fst dword ptr [esp+4]00402AF4 fmul dword ptr [__real@4@3ff88888888888888800 (004101bc)]00402AFA fxch st(1)00402AFC faddp st(1),st00402AFE fld dword ptr [esp+4]00402B02 fmul dword ptr [esp]00402B06 fchs00402B08 fst dword ptr [esp+4]00402B0C fmul dword ptr [__real@4@3ff2d00d00d00d00d000 (004101b8)]00402B12 fxch st(1)00402B14 faddp st(1),st00402B16 fld dword ptr [esp+4]00402B1A fmul dword ptr [esp]00402B1E fchs00402B20 fst dword ptr [esp+4]00402B24 fmul dword ptr [__real@4@3fecb8ef1d2ab639a000 (004101b4)]00402B2A fxch st(1)00402B2C faddp st(1),st00402B2E fld dword ptr [esp+4]00402B32 fmul dword ptr [esp]00402B36 fchs00402B38 fst dword ptr [esp+4]00402B3C fmul dword ptr [__real@4@3fe5d7322b3faa272000 (004101b0)]00402B42 fxch st(1)00402B44 faddp st(1),st00402B46 fld dword ptr [esp+4]00402B4A fmul dword ptr [esp]00402B4E fchs00402B50 fst dword ptr [esp+4]00402B54 fmul dword ptr [__real@4@3fdeb092309d43684800 (004101ac)]00402B5A fxch st(1)00402B5C faddp st(1),st00402B5E fld dword ptr [esp+4]00402B62 fmul dword ptr [esp]00402B66 fchs00402B68 fst dword ptr [esp+4]00402B6C fmul dword ptr [__real@4@3fd6d73f9f399dc0f800 (004101a8)]00402B72 fxch st(1)00402B74 faddp st(1),st00402B76 fld dword ptr [esp+4]00402B7A fmul dword ptr [esp]00402B7E fchs00402B80 fmul dword ptr [__real@4@3fceca963b81856a5000 (004101a4)]00402B86 fxch st(1)00402B88 faddp st(1),st00402B8A fmul dword ptr [__real@4@400fc350000000000000 (004101a0)]00402B90 call __ftol (00403fd8)581: }00402B95 add esp,0Ch00402B98 ret
|
Here are some clock cycles for various cases:
AMD 1.33GHz
sin(): ~86 cycles
SinRecurse() ~ 22 cycles (with 4 terms)
SinRecurse() ~ 34 cycles (with 6 terms)
SinRecurse() ~ 64 cycles (with 8 terms)
SinRecurse() ~ 85 cycles (with 10 terms)
PentiumIII 866MHz
Sin(): Not tested
SinRecurse() ~ 41 cycles (with 4 terms)
SinRecurse() ~ 60 cycles (with 6 terms)
SinRecurse() ~ 82 cycles (with 8 terms)
SinRecurse() ~ 101 cycles (with 10 terms)
Feel free to come up with a better recursive implementation; I did not spend a lot of time looking for ways to optimize it. It’s interesting to note that, although there is a division in the recurse function (for the factorial), there is no division in the final code, as the compiler pre-calculated the inverse of the factorials, even after I looked at a recursive depth of 50.
So there you have it. Again, make of that what you will…
SS
Edited by - Axter on February 16, 2002 3:44:00 PM