skeletal animation optimization
Hey guys.
I am currently working on a skeletal animation class. I'm using Opengl ES 1.1. I have xperimented with many different techniques to achieve optimal rendering efficiency. I have stumbled upon a small problem during my latest imlpementation.
As of right now, the class renders the entire buffer using a single glDrawElements call. But the vertex transforms, are calculated individually, and through a routine in the class. I do not use the ModelView matrix to get OpenGL to perform the transforms for me. This is expensive, i know.
So i now want to get OpenGL to do the transforms for me. That immediately means that i can no longer render the entire buffer with a single call. I must render the vertices bone by bone. Thats cool, but this means i must have no triangles that spread over 2 or more bones, they must all be exclusive to a single bone. This geometry is not 3D artist friendly. Apparently 3D artists dont like making human forms bone by bone. Plus, there is the issue of the form breaking at the joints. I was wondering if there is any way to have the best of both worlds. smooth forms that dont break at the joints, and the ovious efficiency of the rendering bone-by-bone method.
That's a tough problem. Personally, I would continue to do the matrix multiplications in the class but I would write an SSE assembly language function to do the calculations. That would involve the least amount of code re-writing and would provide a significant speed boost. Given that most people don't know ASM that may not be suitable for you though.
The other option that I see is to keep two separate data structures. For the first one, keep all of the 'bones' and polygons which are entirely in one bone. Then render all of those in a loop with one iteration for each bone. Then keep another data structure full of 'ligament' polygons - all of the polygons which cross over from one 'bone' to another. Do the transformations on those vertices in your C++ code and then render them in OpenGL with a model view matrix without transformations. You could render all of those triangles at once because you wouldn't need to change the matrix.
So that way would be a good compromise - you could keep your 3D artists happy, still accelerate most of the transformation calculations, and have no gaps in your models.
Richard
The other option that I see is to keep two separate data structures. For the first one, keep all of the 'bones' and polygons which are entirely in one bone. Then render all of those in a loop with one iteration for each bone. Then keep another data structure full of 'ligament' polygons - all of the polygons which cross over from one 'bone' to another. Do the transformations on those vertices in your C++ code and then render them in OpenGL with a model view matrix without transformations. You could render all of those triangles at once because you wouldn't need to change the matrix.
So that way would be a good compromise - you could keep your 3D artists happy, still accelerate most of the transformation calculations, and have no gaps in your models.
Richard
Sir Richard,
I have just finished implementing the dual data structure approach u just mentioned. There is definately a performance boost, but not as much as i expected. That is mostly due to the model, which i shall have redone, to use the least number of ligament triangles. The SSE assembly routine is a good idea though. I was a little reluctant to implement it, but now that u've mentioned it, i guess i might as well. I'm a nubie, and the experience in assembly will definately do me good.
Thanks for the tip though. I appreciate your prompt responce.
Until the next time then ... sire,
(bows out).
I have just finished implementing the dual data structure approach u just mentioned. There is definately a performance boost, but not as much as i expected. That is mostly due to the model, which i shall have redone, to use the least number of ligament triangles. The SSE assembly routine is a good idea though. I was a little reluctant to implement it, but now that u've mentioned it, i guess i might as well. I'm a nubie, and the experience in assembly will definately do me good.
Thanks for the tip though. I appreciate your prompt responce.
Until the next time then ... sire,
(bows out).
Well if you haven't written assembly code then you might have a tough time because the learning curve is fairly steep. If you really want to do it, you'll need a copy of the Intel IA-32 Instruction Set Reference (two volumes) available for free in PDF from intel.com. To get you started I'll show you some code from my recently-finished game Invasion3D (link).
This code doesn't actually do a full matrix multiply - what it does is a rotation in the Z axis, around a center point. My polygons are quads and there are four vertices for each quad. Another array has one center point for each quad. The code is fairly well commented. The math is like this:
Where Cx,Cy is the center point for that quad. For this code the vertices are in (X,Y,Z,0) format, single-precision floats. Each quad has a byte value between 0 and 31 which controls how fast and in what direction (CW or CCW) it is spinning, and the sin/cos values are taken from lookup tables. There is also a velocity for each quad which is applied to the vertices, that's how I do the gravity. If you run the game and shoot somebody with a machine gun or cannon, watch how the polygons fly all over the place. This is the code that does the math for that effect.
Regards,
Richard
This code doesn't actually do a full matrix multiply - what it does is a rotation in the Z axis, around a center point. My polygons are quads and there are four vertices for each quad. Another array has one center point for each quad. The code is fairly well commented. The math is like this:
X' = x*cos(theta) - y*sin(theta) - Cx*cos(theta) + Cy*sin(theta) + CxY' = x*sin(theta) + y*cos(theta) - Cx*sin(theta) - Cy*cos(theta) + Cy
Where Cx,Cy is the center point for that quad. For this code the vertices are in (X,Y,Z,0) format, single-precision floats. Each quad has a byte value between 0 and 31 which controls how fast and in what direction (CW or CCW) it is spinning, and the sin/cos values are taken from lookup tables. There is also a velocity for each quad which is applied to the vertices, that's how I do the gravity. If you run the game and shoot somebody with a machine gun or cannon, watch how the polygons fly all over the place. This is the code that does the math for that effect.
Regards,
Richard
;------------------------------------------------------------------------------; void asmParticleQuadSSE(int iQuads, float (*pfVertex)[3], float *pfCenter,; float *pfVelocity, unsigned char *pucRot, float (*pfSinCos)[2],; float (*pfCosSin)[2], unsigned int uiFrameTime);;; Applies velocity and rotation to particle engine quads using SSE instructions;; iQuads - number of quads to process; pfVertex - float (*)[4][4] array of vertex coordinates; pfCenter - float (*)[4] array of quad centers; pfVelocity - float (*)[4] array of quad velocities; pucRot - array of [0,31] indices into sin/cos arrays for angular amount; pfSinCos - float[32][2] array of -sin, +cos values; pfCosSin - float[32][2] array of +cos, +sin values; uiFrameTime - number of milliseconds in this frame;align 16asmParticleQuadSSE: push ebp mov ebp, esp pushad ; load up the registers movd mm0, [ebp+36] mov ecx, [ebp+8] ; iQuads mov edi, [ebp+12] ; pfVertex punpckldq mm0, mm0 mov esi, [ebp+16] ; pfCenter mov eax, [ebp+20] ; pfVelocity cvtpi2ps xmm0, mm0 mov ebx, [ebp+24] ; pucRot mov edx, [ebp+28] ; pfSinCos mov ebp, [ebp+32] ; pfCosSin movlhps xmm0, xmm0 ; 4 x uiFrameTime xorps xmm3, xmm3 ; clear the high 2 dwords in these registers xorps xmm4, xmm4 xorps xmm5, xmm5 xorps xmm7, xmm7pqLoop1: ; load up all the registers for the rotation push ebx ; pucRot pointer movzx ebx, byte [ebx] ; ebx == angular index for this quad movlps xmm6, [esi] ; xmm6 = CenterY, CenterX movlps xmm2, [edi] ; xmm2 = VertexY, VertexX movaps xmm1, xmm6 unpcklps xmm2, xmm2 ; xmm2 = vY, vY, vX, vX unpcklps xmm6, xmm6 ; xmm6 = cY, cY, cX, cX movhlps xmm3, xmm2 ; xmm3 = vY, vY movhlps xmm7, xmm6 ; xmm7 = cY, cY movlps xmm4, [edx+ebx*8] ; xmm4 = +cos_t, -sin_t movlps xmm5, [ebp+ebx*8] ; xmm5 = +sin_t, +cos_t ; calculate the rotation around the center point for vertex 0 and store mulps xmm3, xmm4 mulps xmm2, xmm5 mulps xmm7, xmm4 mulps xmm6, xmm5 addps xmm3, xmm2 addps xmm7, xmm6 addps xmm3, xmm1 subps xmm3, xmm7 movlps [edi], xmm3 ; now do vertex 1 movlps xmm2, [edi+16] ; xmm2 = VertexY, VertexX movaps xmm6, xmm1 ; xmm6 = CenterY, CenterX unpcklps xmm2, xmm2 ; xmm2 = vY, vY, vX, vX unpcklps xmm6, xmm6 ; xmm6 = cY, cY, cX, cX movhlps xmm3, xmm2 ; xmm3 = vY, vY movhlps xmm7, xmm6 ; xmm7 = cY, cY ; calculate the rotation around the center point for vertex 1 and store mulps xmm3, xmm4 mulps xmm2, xmm5 mulps xmm7, xmm4 mulps xmm6, xmm5 addps xmm3, xmm2 addps xmm7, xmm6 addps xmm3, xmm1 subps xmm3, xmm7 movlps [edi+16], xmm3 ; now do vertex 2 movlps xmm2, [edi+32] ; xmm2 = VertexY, VertexX movaps xmm6, xmm1 ; xmm6 = CenterY, CenterX unpcklps xmm2, xmm2 ; xmm2 = vY, vY, vX, vX unpcklps xmm6, xmm6 ; xmm6 = cY, cY, cX, cX movhlps xmm3, xmm2 ; xmm3 = vY, vY movhlps xmm7, xmm6 ; xmm7 = cY, cY ; calculate the rotation around the center point for vertex 2 and store mulps xmm3, xmm4 mulps xmm2, xmm5 mulps xmm7, xmm4 mulps xmm6, xmm5 addps xmm3, xmm2 addps xmm7, xmm6 addps xmm3, xmm1 subps xmm3, xmm7 movlps [edi+32], xmm3 ; now do vertex 3 movlps xmm2, [edi+48] ; xmm2 = VertexY, VertexX movaps xmm6, xmm1 ; xmm6 = CenterY, CenterX unpcklps xmm2, xmm2 ; xmm2 = vY, vY, vX, vX unpcklps xmm6, xmm6 ; xmm6 = cY, cY, cX, cX movhlps xmm3, xmm2 ; xmm3 = vY, vY movhlps xmm7, xmm6 ; xmm7 = cY, cY ; calculate the rotation around the center point for vertex 3 and store mulps xmm3, xmm4 mulps xmm2, xmm5 mulps xmm7, xmm4 mulps xmm6, xmm5 addps xmm3, xmm2 addps xmm7, xmm6 addps xmm3, xmm1 subps xmm3, xmm7 movlps [edi+48], xmm3 ; calculate velocity to apply movaps xmm1, [eax] mulps xmm1, xmm0 ; load the vertex and center coordinates, add the velocity, and store movaps xmm2, [edi] movaps xmm3, [edi+16] movaps xmm4, [edi+32] movaps xmm5, [edi+48] movaps xmm6, [esi] addps xmm2, xmm1 addps xmm3, xmm1 addps xmm4, xmm1 addps xmm5, xmm1 addps xmm6, xmm1 movaps [edi], xmm2 movaps [edi+16], xmm3 movaps [edi+32], xmm4 movaps [edi+48], xmm5 movaps [esi], xmm6 ; advance pointers to next quad and loop pop ebx add edi, byte 64 add esi, byte 16 add eax, byte 16 add ebx, byte 1 dec ecx jne near pqLoop1 popad pop ebp emms ret
July 02, 2005 01:02 AM
Hi Richard,
What did you use to compile your asm code? I've been doing inline __asm stuff via MSVC 6 and .Net 2003. I just noticed the ; comments I used to see back in the day with MASM and other asm compilers. Do you just make obj files with a asm compiler and link it in or?? Just wondering. Thanks, code looks great too btw.
What did you use to compile your asm code? I've been doing inline __asm stuff via MSVC 6 and .Net 2003. I just noticed the ; comments I used to see back in the day with MASM and other asm compilers. Do you just make obj files with a asm compiler and link it in or?? Just wondering. Thanks, code looks great too btw.
Quote: Original post by Anonymous Poster
Hi Richard,
What did you use to compile your asm code? I've been doing inline __asm stuff via MSVC 6 and .Net 2003. I just noticed the ; comments I used to see back in the day with MASM and other asm compilers. Do you just make obj files with a asm compiler and link it in or?? Just wondering. Thanks, code looks great too btw.
I do a lot of assembly coding and optimization work. Personally I think that inline assembly (__asm for Intel and VC compilers) is rarely the right tool to use, especially if you are supporting multiple platforms (Linux and Win32) because the GCC compiler uses AT&T syntax which is totally different from the intel-style syntax used by VC and ICC. Sometimes it's useful but if you can break the optimized code off in it's own function that's better. For the code shown above I use NASM for the assembler. You can integrate it into VC by creating a new Folder in your Project. Call it NASM and put your .asm source files into this folder. Then for each asm source file you edit the properties on the 'Custom build step' page. Under 'command line' put this:
nasmw -f win32 -D CFG_WIN32 -o $(IntDir)\$(InputName).obj "$(InputPath)"
You can use preprocessor definitions for NASM on the command line by doing -D <name> as shown above. The CFG_WIN32 definition is used for my code but if you don't use you can leave it out. Under 'outputs' put this:
$(IntDir)\$(InputName).obj
Make sure that you have NASM installed (it's free on the web) and it should build and link just fine.
Regards,
Richard
Converting the code to asm is (in my opinion) a large waste of time. Run the code through a profiler to find out what the actual bottlenecks are - rather than using guestimation. (divisions by zero usually cause the biggest slowdowns - check for those first!!)
Converting matrix routines to SSE will give you far faster matrix routines, however i would be very suprised if that was your bottleneck (ie, it'll probably make about 1% difference at best - And thats assuming that you can out-optimise a compiler - which is very very unlikely!).
A far better optimisation is to simply do less work!!!!
For example. If you have a fixed 60fps rate, then update half the surfaces one frame, the other half the next frame (ie they update at 30fps). The result is a 50% drop in actual work done per frame, which usually results in a 4x increase in frame rate (ie, balancing workload across frames makes much better use of the CPU & AGP bandwidth - to the extent that normally it means that the surfaces update with twice the frequency). You can also think about using a lod based update mechanism (ie, if far away, i'm happy for the anim to update at 7.5 of 15fps). The only issue here is one of scheduling to balance the load as much as possible.
Simply converting code to asm is a naive and counter productive optimisation. Setting your compiler to full optimisation with intrinsics enabled will, 99.99% of the time, give you a faster output than hand coded asm. (and the difference in optimising that last 0.01% will probably only give you a 0.01% increase anyway....)
For real gains in speed,
1. Do less work on the CPU
2. use a profiler!!!!
Converting matrix routines to SSE will give you far faster matrix routines, however i would be very suprised if that was your bottleneck (ie, it'll probably make about 1% difference at best - And thats assuming that you can out-optimise a compiler - which is very very unlikely!).
A far better optimisation is to simply do less work!!!!
For example. If you have a fixed 60fps rate, then update half the surfaces one frame, the other half the next frame (ie they update at 30fps). The result is a 50% drop in actual work done per frame, which usually results in a 4x increase in frame rate (ie, balancing workload across frames makes much better use of the CPU & AGP bandwidth - to the extent that normally it means that the surfaces update with twice the frequency). You can also think about using a lod based update mechanism (ie, if far away, i'm happy for the anim to update at 7.5 of 15fps). The only issue here is one of scheduling to balance the load as much as possible.
Simply converting code to asm is a naive and counter productive optimisation. Setting your compiler to full optimisation with intrinsics enabled will, 99.99% of the time, give you a faster output than hand coded asm. (and the difference in optimising that last 0.01% will probably only give you a 0.01% increase anyway....)
For real gains in speed,
1. Do less work on the CPU
2. use a profiler!!!!
Quote: Original post by RobTheBloke
Converting the code to asm is (in my opinion) a large waste of time. Run the code through a profiler to find out what the actual bottlenecks are - rather than using guestimation. (divisions by zero usually cause the biggest slowdowns - check for those first!!)
Converting matrix routines to SSE will give you far faster matrix routines, however i would be very suprised if that was your bottleneck (ie, it'll probably make about 1% difference at best - And thats assuming that you can out-optimise a compiler - which is very very unlikely!).
A far better optimisation is to simply do less work!!!!
I agree with parts of what you've written but not all. It is important to look for algorithmic improvements and sometimes big gains can be found in this area. And for scalar (non-MMX or SSE) code it's true that modern compilers can meet or beat hand-asm code on the x86. But that's not true at all for vector code. The only compiler that can vectorize worth a darn is ICC and I can still beat it by a big margin with hand MMX and SSE code. That's actually how I got the job that I've had for the past year. I started with an already optimized h264 encoder and decoder (two different code bases) and I sped both of them up by a factor of 2 by writing MMX and SSE2 code. It can make a big difference.
It's possible that for this individual's project, the transformations aren't taking a large portion of the time. As you said, using a profiler will show where the hotspots are and is a necessary step for any optimization effort. But for algorithms that are well-suited to vector implementations, such as these transformations, it is very advantageous to write MMX/SSE code if they are taking large portions of the total execution time. At least that's what I've found and I've been working on vector optimizations with VC, ICC, and GCC every workday for the past year. :)
Richard
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement