Advertisement

x*1/y vs. x/y

Started by March 01, 2001 10:18 AM
14 comments, last by Promiscuous Robot 23 years, 11 months ago
How will the CPU guess the answer?



There is no off position on
the genius switch
-=-=-=-A M I N O T M E R C I F U L ! ? ! ?
It caches certain values that it uses repeatidly, and tries to use those for future calculations, and will try to optimize repeated calculations by reusing them previous to all of the information being available. Go to intel.com and read about prediction and the cache.

"Finger to spiritual emptiness underlying everything." -- How a C manual referred to a "pointer to void." --Things People Said
Resist Windows XP''s Invasive Production Activation Technology!
http://www.gdarchive.net/druidgames/
Advertisement
You need to run the test without f1 and f2 actually doing anything and subtract out the times from your test. Since the loops are the same it won''t change which is faster, but it will affect how much faster the fastest one is.
Keys to success: Ability, ambition and opportunity.
re: caching
Hence the "rand" values. No cache prediction possible.

re: more accurate measures
You''re welcome to do so if you would like. I have convinced myself that one method is faster if there are 2+ divisions by the same value. I''d be curious to see contradicting evidence, but I don''t feel the urge to delve into this further.
the best solution for that program would be to put each test in a seperate program, and run each program several times, and average the results. This would avoid any problems with jump prediction and the cache

-arsenius
''after three days without programming, life becomes meaningless'' -The Tao of Programming
I was interested a little more detail so I did some testing. I ran six tests. These can be broken into two sets. The first three assign a random value to d in each iteration. The second set assigns a random value to d once before the loop starts. The first test in both sets calls a function that does nothing. The second one calls a function doing two divides and the third does two multiplies. The third one in the first set also does a divide to get the reciprical where that is done before the loop in the second set. Rather than using 100,000,000 iterations as Stoffel did I cut it down to 10,000,000. Although this loses a digit of significance on the times I felt that being correct to three digits was close enough. The timer I was using seems to be accurate to 1/800th so I rounded to the nearest 1/100th. I also ran the test four times to verify the results were reasonably repeatable. I used rand()+1.0 for d rather than rand() to avoid divide by zero errors. Finally I seeded the random number generator before each test with the same value so that each test would be dealing with the same series of numbers. The idea was that the FPU may check for certain common values or patterns before doing the full operation. The following is the results from the testing:

  Test1-3: Variable dTest1: 14.93 14.93 14.93 14.94Test2: 17.04 17.05 17.04 17.05Test3: 16.47 16.51 16.48 16.47Test4-6: constant dTest4: 10.02 10.02 10.02 10.01Test5: 12.03 12.04 12.03 12.04Test6: 10.58 10.59 10.59 10.57Mins: (T1) 14.93 (T2) 17.04 (T3) 16.47 (T4) 10.01 (T5) 12.03 (T6) 10.57T2-T1(D1):  2.11T3-T1(D2):  1.54T5-T4(D3):  2.02T6-T4(D4):  0.56T1-T4(D5):  4.92T2-T5(D6):  5.01T3-T6(D7):  5.90D7-D5(D8):  0.98D7-D6(D9):  0.89D2-D4(D10): 0.98T1-3*D5(D11): 0.22100*D5/T1(P1): 32.95%100*D1/T2(P2): 12.38%100*D2/T3(P3):  9.35%100*D3/T5(P4): 16.79%100*D4/T6(P5):  5.30%100*D1/D2-100(P6):  37.01%100-100*D2/D1(P7):  27.01%100*D3/D4-100(P8): 260.71%100-100*D4/D3(P9):  72.28%  


I used the mins because each pass did exactly the same work. The only variable was what other processes did. Those with lower times got interupted less than those with higher times. Another way of looking at it is that the work can be done in the minimum time it took to do the work since it was done in that length of time at least once. Rounding is another matter as well as the resolution of the timer, but the goal is really only to be within +/- 1% of the actual value. The values are easily within that range.

D1 thru D4 shows the amount of time directly attributable to the calculations. D5 thru D7 shows the effect of moving the third call to rand out of the loop. D5 shows the amount of time spent calling the third rand. I''m unsure why D6 is not equal to D5, but I assume it is because some divisions are faster than others. It would take a bit of work to actually prove that though and the differance is negligible. There is a somewhat more significant differance between D7 and D5 given by D8. This differance could be strictly the amount of time spent calculating the reciprical, but if the division went faster then possibly the multiplication by the reciprical went faster as well. So reasonably it could be somewhere between D8 and D9. The differance between D2 and D4, given by D10, provides a tie breaker though. It is roughly equal to D8. Since the values are rounded I can''t say they are precisely equal, but they are within a hundredth of a second of one another.

The reason I felt it was important to eliminate the overhead was the D5 value. The value for P1 shows that just one call to rand took 32.95% of the overhead. The three calls combined could reasonably be expected to be 98.95% of the overhead. Normally I wouldn''t worry about the looping overhead since reasonably you would have to be looping to make an optimization worthwhile. You most likely wouldn''t be calling rand. D11 shows that the overhead would drop to somewhere near 0.22 without the calls to rand. Since there were 10 million iteration that says you could loop and call a function that does nothing somewhere around 45 million times a second. That seems a little high, but a reasonable number.

As far as percentages P2 shows that only 12.38% of the time spent on Test2 was spent actually calculating the values if you consider the FLD and FSTP an integral part of the calculation. P3 shows that only 9.35% of the time spent on Test3 was spent actually calculating the values. P4 and P5 shows the corresponding values for Test5 and Test6. Within Test2, Test3, Test5 and Test6 the main thing you are timing is the overhead. If you compare the time of the tests it seems like a small differance, but that is because the differance is drown out by the overhead of the rand calls. P6 shows that the differance between dividing two values by a third values takes 37.01% longer than calculating the reciprical and multipling both values by the reciprical. That is more than an order of magnitude differant than the 3.46% longer for the overall tests. With the overhead averaging over 90% that is what you would expect. Without the rand overhead you would get a percentage of around 32.39% assuming the estimate for the rest of the overhead given by D11 is accurate. That is low, but a far more accurate indication of how much effect the change has in a real application. 3.46% could lead you to believe improvement would be trivial when it might in fact get you to your performance goal.

The second set of test show basically what you could expect from a loop where your divisor is a constant. Here P8 shows that it takes 2.6 times or 260% longer to do the two divisions than the two multiplies. That tells you that a division takes roughly 2.6 times longer than a division. The differance is actually greater than that because of the overhead of the FLD and FSTP. So if you say FMUL is a 1 and FDIV is a 3.6 then the reciprical costs 3.6 and the two multiplies cost 1 giving a total of 5.6 where the two divides cost 7.2. That would say it takes about 27.58% longer which matchs reasonably close to 37.01% from P6. If you solve 2x/(x+2Y)=1.3701 for x you get 4.35 for a cost of the divide relative to the multiply. Roughly averaging the two you could say it takes about 4 times longer to do a divide as a multiply.

Not that anyone but me cares, but writing this helps me distill the test results into something I can use as a rule of thumb. Since I wrote it I figured I might as well post it.
Keys to success: Ability, ambition and opportunity.

This topic is closed to new replies.

Advertisement