quaternion to matrix, < 13 mults?

jitspoe · 2002-05-28T13:23:22

'Scuse me if this has been covered already. I'm new here. I saw lots of stuff on the topic of quaternions, but I didn't see anything covering optimizing a quat. to mat. function. I'm too sleepy to keep looking right now, so I'll just post this. I managed to optimize the function down to 13 multiplies. I was curious if it was possible to optimize it anymore than that. Here's my code: void pquat_to_matrix(pvector4f qinp, GLfloat m[4][4]) { pquat q; float q00, q01, q11, q12, q22, q13, q03, q23, q20; /* Since everything gets multiplied by 2, scale the quat * by sqrt(2), that way when 2 values are multiplied, * the end result is the same. */ pvector4f_copy(qinp,q); pvector4f_scale(q,SQRT2,q); /* Each mult. gets used twice, so just do one mult * for each and store in a temp var */ q00 = q[0]*q[0]; q20 = q[2]*q[0]; q01 = q[0]*q[1]; q11 = q[1]*q[1]; q12 = q[1]*q[2]; q22 = q[2]*q[2]; q03 = q[0]*q[3]; q13 = q[1]*q[3]; q23 = q[2]*q[3]; /* Create the matrix */ m[0][0]=1-(q11+q22); m[1][0]= (q01-q23); m[2][0]= (q20+q13); m[3][0]=0; m[0][1]= (q01+q23); m[1][1]=1-(q22+q00); m[2][1]= (q12-q03); m[3][1]=0; m[0][2]= (q20-q13); m[1][2]= (q12+q03); m[2][2]=1-(q11+q00); m[3][2]=0; m[0][3]=0 ; m[1][3]=0 ; m[2][3]=0 ; m[3][3]=1; } [edited by - jitspoe on May 25, 2002 6:22:03 AM]

Math and Physics Programming

Started by jitspoe May 25, 2002 05:19 AM

15 comments, last by jitspoe 22 years, 8 months ago

AndyMan

148

May 26, 2002 07:13 PM

oops: 1 << (5) == 32

With integers add/sub is way faster than mul, but with floats, the relative difference is less I believe.

LilBudyWizer

491

May 27, 2002 01:07 AM

Is multiplication way faster than add/sub? To test that assumption I ran three tests. Each allocated a 1,000,000 by 3 array. The first generated random numbers for the first two columns, read the time, set the third column equal to the product of the other two one hundred times and read the time again. That is it went through the table from start to end 100 times as opposed to first row 100 times then second row. The second test did exactly the same thing except it did an add instead and the third did nothing. I got 1.92, 1.76 and .55 seconds. That seems to say multiply took 13.2% longer. That isn''t quite way to me. Rather I would call that marginal. Admittedly it is memory bound. There are a lot of limitations to the test, but still I got the results I did on this test. Multiplication lends itself well to being done in parallel. When transistors were scarce and expensive there was a huge differance, but it seems pretty marginal now.

As for testing if using an array matters. You use the CPU view on the debugger to see the generated assembler. Assuming you have debug information there will be comments showing you the source line followed by the generated assembler. If you have optimization turned on there isn''t necessarily a nice, neat correspondence between the generated code and assembler.

I don''t think it is going to make an improvement. I recently changed versions of my development tool. I don''t know how much is a change in version or befuddled memory. I tested a matrix multiply using two 4 x 4 matrices versus a structure. The 4 x 4 matrices actually had the extra instructions. I would swear I had the opposite result before. Oh well, that''s why assumptions are only good for designing a test and not for predicting its result.

Keys to success: Ability, ambition and opportunity.

johnb

352

May 27, 2002 08:09 AM

On many modern processor FP add and multiply are as quick as each other, so minimising the number of multiplications can make things worse if it increase the number of additions and other FP operations.

Another factor is that many processors also have a multiply-add instruction that is as fast as a single multiply. This greatly accelerates much vector, matrix and quaternion code as much of it relies on sums of products done as quickly as possible. But it does require some thought (or hand coded assembly) to take full advantage of it.

But the biggest speedup can be got from taking advantage of the parallel/vector/SSE units in the processors in all PCs (and all ''next gen'' game consoles). These can perform such operations an order of magnitude faster than code on an FPU, and so are the only way to go for performance-critical code. Unfortunately this is far harder to do as it means hand coding assembler for each processor architecture, but the benefits are usually well worth it.

John BlackburneProgrammer, The Pitbull Syndicate