sin, cos, since when?
I would assume a lookup table is still faster and rather what changed is the CPU now uses one or at least a bigger one.
Keys to success: Ability, ambition and opportunity.
i''m pretty sure, even on a 50mhz amiga with FPU, sin and cos was faster than lut. not quite sure on that though.
perhaps programming for mobile phones luts would be good again for the moment.
perhaps programming for mobile phones luts would be good again for the moment.
First, use a small lookup table say 16K or less. Then only in inner loops only where addressing the LUT is faster because it will replace several calculations and/or comparisons.
Or am I crazy?!?!?
Here''s an idea I was giving someone for alpha blending http://www.gamedev.net/community/forums/topic.asp?topic_id=120728
Was I wrong about a speed increase?! Even if the code Craazer posted had been better designed(removing the 6 function calls per pixel).
Yes, I''m old school C mostly! But I''m working on it!
Or am I crazy?!?!?
Here''s an idea I was giving someone for alpha blending http://www.gamedev.net/community/forums/topic.asp?topic_id=120728
Was I wrong about a speed increase?! Even if the code Craazer posted had been better designed(removing the 6 function calls per pixel).
Yes, I''m old school C mostly! But I''m working on it!
Check out Intel''s Site, if you are interested in performance on Intel based computers.
Some Links:
http://www.intel.com/design/PentiumII/manuals/24512701.pdf
See Page 2-21, Transcendental Functions
Mentions that software implementations will NEVER be as fast as the hardware implementations, unless accuracy is sacrificed.
http://www.intel.com/technology/itj/q41999/articles/art_5.htm
Talks about how the Transcendental Functions are comupted & gives some latency for double precision :
Function Latency (cycles) Max. Error (ulps)
cbrt 60 0.51
exp 60 0.51
ln 52 0.53
sin 70 0.51
cos 70 0.51
tan 72 0.51
atan 66 0.51
Also, here is a assembly snippet from an asm function I had lying around:
__asm {
fld DWORD PTR [esp+4]
fsin
ret 4
}
Hopefully this helps.
Some Links:
http://www.intel.com/design/PentiumII/manuals/24512701.pdf
See Page 2-21, Transcendental Functions
Mentions that software implementations will NEVER be as fast as the hardware implementations, unless accuracy is sacrificed.
http://www.intel.com/technology/itj/q41999/articles/art_5.htm
Talks about how the Transcendental Functions are comupted & gives some latency for double precision :
Function Latency (cycles) Max. Error (ulps)
cbrt 60 0.51
exp 60 0.51
ln 52 0.53
sin 70 0.51
cos 70 0.51
tan 72 0.51
atan 66 0.51
Also, here is a assembly snippet from an asm function I had lying around:
__asm {
fld DWORD PTR [esp+4]
fsin
ret 4
}
Hopefully this helps.
Wow, how low level can you go. Those are good sources to get an idea of what it was like to optimize code for an Intel PII.
Well cos/sin 70 cycles....hmmm? VS Addressing a double out of a 360 element look up table... interesting
Can anyone post some test results in this case???
Sorry, I can't. Don't know how to accurately determine cycles.
Crap - Way back in the days it took a 486 257-354 cycles to perform FSIN or FCOS. Well they have made quite an improvement over the years.
[edited by - CodeJunkie on October 28, 2002 8:16:59 PM]
Well cos/sin 70 cycles....hmmm? VS Addressing a double out of a 360 element look up table... interesting
Can anyone post some test results in this case???
Sorry, I can't. Don't know how to accurately determine cycles.
Crap - Way back in the days it took a 486 257-354 cycles to perform FSIN or FCOS. Well they have made quite an improvement over the years.
[edited by - CodeJunkie on October 28, 2002 8:16:59 PM]
#include
#include
#include
#include <time.h>
#ifndef PI
#define PI 3.141592654
#endif
#define ndegrees 100000
int main(){
long int t0,t1,delLUT,delSIN;
double s1n[ndegrees];
double d2r=PI/(ndegrees*0.5);
double r2d=ndegrees*0.5/PI;
for(int i=0;i<360;i++){
s1n=sin((float)i*d2r);
}
double val;
t0=clock();
for(int j=0;j<1000;j++){
for(i=0;i val=s1n;<br> }<br> }<br> t1=clock();<br><br> delLUT=t1-t0;<br><br> t0=clock();<br> for(j=0;j<1000;j++){<br> for(i=0;i<ndegrees;i++){<br> val=sin(d2r*i);<br> }<br> }<br> t1=clock();<br><br> delSIN=t1-t0;<br><br> printf("del time (milisecs) for LUT=%d and for processor sin=%d\n processor faster by %f\n",<br> delLUT,delSIN,(float)delLUT/(float)delSIN);<br><br><br><br><br><br><br> return 0;<br>}<br><br>can somebody remind me how to post source. supprisingly to me, LUT's are 6-8 times faster on my 1.2 ghz duron with DDR memory.<br><br><SPAN CLASS=editedby>[edited by - kindfluffysteve on October 28, 2002 8:52:17 PM]</SPAN> </i>
#include
#include
#include <time.h>
#ifndef PI
#define PI 3.141592654
#endif
#define ndegrees 100000
int main(){
long int t0,t1,delLUT,delSIN;
double s1n[ndegrees];
double d2r=PI/(ndegrees*0.5);
double r2d=ndegrees*0.5/PI;
for(int i=0;i<360;i++){
s1n=sin((float)i*d2r);
}
double val;
t0=clock();
for(int j=0;j<1000;j++){
for(i=0;i
Um, an integer to double conversion followed by a multiplication doesn''t seem to me to accurately reflect what you would actually do in a program. It seems you should at least use a double for that loop.
Keys to success: Ability, ambition and opportunity.
First of all, in contrived examples like that the lut will win out every time as that is exactly what data caches are built to handle.
However, in real world scenarios, you do not want your 256k of precious L2 cache taken up by one lookup table. A cache miss is quickly becoming the most expensive operation on a CPU (if it isn''t already) and when you have that massive sin table sitting around it will send your cache misses through the roof.
If you are building your app to target the new generation of processors (P3,Athlon,etc..) then sin/cos LUT''s are generally more harm than good.
My advice, stay away from them and if you are using VC++ enable intrinsic operations which will change a sin(x) call from an actual function call to just a straight fsin asm opcode.
However, in real world scenarios, you do not want your 256k of precious L2 cache taken up by one lookup table. A cache miss is quickly becoming the most expensive operation on a CPU (if it isn''t already) and when you have that massive sin table sitting around it will send your cache misses through the roof.
If you are building your app to target the new generation of processors (P3,Athlon,etc..) then sin/cos LUT''s are generally more harm than good.
My advice, stay away from them and if you are using VC++ enable intrinsic operations which will change a sin(x) call from an actual function call to just a straight fsin asm opcode.
I noticed I didn''t really explain *why* things changed however.
Basically on the older generation of computers, 386,486, etc, calculating a sin was extremely slow (on the order of hundreds of cycles) and a cache miss wasn''t that painful as the execution speeds were still very close to the overall memory speed.
However, in more modern systems RAM is much slower than the processor and having to go outside the processors cache (L1 or L2) is a very expensive operation that is to be avoided at all costs.
So, in the old days, a 64k sin table would be great as it gave you nearly free sin''s as compared to the dog-slow fpu operations (if you even had an fpu). Slowly however, processors began to become much faster than the memory speeds and so cache''s started gaining in importance. Suddenly that 64k tradeoff didn''t look so great when your fsin operation only took 35 cycles or so. And on newer processors it''s gotten even better with on-die lookup tables for the most frequently used math operations (which are nigh on impossible to beat
).
As to when the change took place, it was a gradual shift (that is still happening) that started 4 or 5 years ago IIRC.
Basically on the older generation of computers, 386,486, etc, calculating a sin was extremely slow (on the order of hundreds of cycles) and a cache miss wasn''t that painful as the execution speeds were still very close to the overall memory speed.
However, in more modern systems RAM is much slower than the processor and having to go outside the processors cache (L1 or L2) is a very expensive operation that is to be avoided at all costs.
So, in the old days, a 64k sin table would be great as it gave you nearly free sin''s as compared to the dog-slow fpu operations (if you even had an fpu). Slowly however, processors began to become much faster than the memory speeds and so cache''s started gaining in importance. Suddenly that 64k tradeoff didn''t look so great when your fsin operation only took 35 cycles or so. And on newer processors it''s gotten even better with on-die lookup tables for the most frequently used math operations (which are nigh on impossible to beat

As to when the change took place, it was a gradual shift (that is still happening) that started 4 or 5 years ago IIRC.
But how many clocks does a stall take away? 60-70 clock cycles is a huge amount. A data lookup takes two clocks.
~CGameProgrammer( );
~CGameProgrammer( );
~CGameProgrammer( );
Developer Image Exchange -- New Features: Upload screenshots of your games (size is unlimited) and upload the game itself (up to 10MB). Free. No registration needed.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement
Recommended Tutorials
Advertisement