Advertisement

Assembly : When is it worth your time?

Started by May 25, 2004 02:57 PM
101 comments, last by OpenGL_Guru 20 years, 5 months ago
I guess the important message as always is to only optimized where needed. The point in the high-level language to start with is so that you can get much more done in less time, and more reliably as well. Use assembler only where it is needed. That niche is becoming smaller and smaller as compilers get better but it still does exist.

Basically make sure to profile BEFORE optimizing. Don''t spend time and convolute your code by optimizing the wrong parts. Generally a normal program will only have 1 or 2 bottlenecks in very small sections of code (we''re talking on the order of 10 lines here). These are easy to recode in assembler to see if they are faster.

In my experience, RARELY will I get a performance boost by changing general code to ASM. I''m not claiming to be the best ASM programmer who ever lived, but all humility aside, I can write some fairly good code. The only place where I get any speed increase is when converting vectorized code to SIMD, and even then I often find that the code is simply cache/memory bandwidth limited, and the SIMD makes no difference at all.

So while I do agree that assembly programming is interesting and sometimes useful, it certainly isn''t what it was years ago when you could get on the order of a 10x speed increase by recoding in a lower level language.
Why doesn''t someone post a small, generic, asm program, and then challenge everyone to make a C/C++ version that runs as fast (with a decent compiler).

Also, is there a simple equivalent to bsf in C?
Advertisement
quote: Original post by Tree Penguin
quote: Original post by RipTorn
int values[5000][5000];void doSomething(){	for (int y=0; y<5000; y++)		for (int x=0; x<5000; x++)			values[x][y]++;}    

would actually be _extremly_ inefficient code... and the compiler would likly have to be very smart to optimize that... Yet I''d bet that occurs in a lot of people code. This code is not very efficient with the stack, but far more importantly it''s hugly inefficient with the cache. The thing is it would be hard to write it with the same inefficiency in ASM...

My problem is some of my alogrithms are ugly but also more complex than this one so it would be hard to optimise those in ASM, however i do think it might make a difference (and probably not just a small one) because of the use of stacked loops and such.



[edited by - Tree Penguin on June 3, 2004 5:32:21 AM]

int values[5000][5000];void doSomething(){	for (int y=0; y<5000; ++y)		for (int x=0; x<5000; ++x)			++values[x][y];}    

is that anymore efficient?
how would optimize that code to make it as efficient as possible?

Beginner in Game Development?  Read here. And read here.

 

quote: Original post by DBX
Why doesn''t someone post a small, generic, asm program, and then challenge everyone to make a C/C++ version that runs as fast (with a decent compiler).

Also, is there a simple equivalent to bsf in C?


    __asm    {        femms        mov         eax,[a]        mov         ecx,[b]        movq        mm0,[eax]       ; mm0 = ax|aw        movq        mm1,[ecx]       ; mm1 = bx|bw                                                  pfmul       mm0,mm1         ; mm0 = ax*bx|aw*bw        movq        mm2,[eax+8]     ; mm2 = az|ay        movq        mm3,[ecx+8]     ; mm3 = bz|by                                                  mov         edx,[r]                   movq        mm5,mm1         ; mm5 = bx|bw                                                  movq        mm4,mm0         ; mm4 = ax*bx|aw*bw        punpckhdq   mm0,mm0         ; mm0 = ax*bx|ax*bx                                                  pfmul       mm2,mm3         ; mm2 = az*bz|ay*by        pfsub       mm4,mm0         ; mm4 = ???|aw*bw-ax*bx                                                  punpckhdq   mm5,mm5         ; mm5 = bx|bx        movq        mm6,mm3         ; mm6 = bz|by                                                  pfacc       mm2,mm2         ; mm2 = ???|az*bz+ay*by        punpckldq   mm5,mm1         ; mm5 = bw|bx                                                  pfmul       mm5,[eax]       ; mm5 = ax*bw|aw*bx        punpckhdq   mm6,mm6         ; mm6 = bz|bz                                                  pfsub       mm4,mm2         ; mm4 = ???|rw        movq        mm7,[eax+8]     ; mm7 = az|ay                                                  pfacc       mm5,mm5         ; mm5 = ???|ax*bw+aw*bx        punpckldq   mm6,mm3         ; mm6 = by|bz                                                  pfmul       mm6,mm7         ; mm6 = az*by|ay*bz        movd        [edx],mm4       ; mm4 = ???|rw                                                  movd        mm2,[eax]       ; mm2 = ???|aw        movq        mm0,mm1         ; mm0 = bx|bw        pfadd       mm5,mm6         ; mm5 = ???|ax*bw+aw*bx+ay*bz        punpckhdq   mm6,mm6         ; mm6 = ay*bz|ay*bz                                                  punpckldq   mm0,mm0         ; mm0 = bw|bw        punpckldq   mm2,mm2         ; mm2 = aw|aw                                                  pfsub       mm5,mm6         ; mm5 = ???|rx        pfmul       mm0,mm7         ; mm0 = az*bw|ay*bw                                                  punpckhdq   mm7,[eax+0]     ; mm7 = ax|az        pfmul       mm2,mm3         ; mm2 = bz*aw|by*aw                                                  punpckhdq   mm1,mm1         ; mm1 = bx|bx        movd        [edx+4],mm5     ; mm5 = ???|rx        pfadd       mm2,mm0         ; mm2 = az*bw+bz*aw|by*aw+ay*bw        punpckldq   mm1,mm3         ; mm1 = by|bx        pfmul       mm7,mm1         ; mm7 = ax*by|az*bx        movd        mm4,[eax+4]     ; mm3 = ???|ax        punpckhdq   mm3,[ecx+8]     ; mm3 = bx|bz        pfadd       mm2,mm7         ; mm2 = az*bw+bz*aw+ax*by|by*aw+ay*bw+az*bx        punpckldq   mm4,[eax+4]     ; mm4 = ay|ax        pfmul       mm3,mm4         ; mm3 = bx*ay|bz*ax        pfsub       mm2,mm3         ; mm2 = rz|ry        movq        [edx+8],mm2     ; mm2 = rz|ry        femms


The quaterion multiplication using 3DNow!.
Give it your best.

The code is not so generic (only for AMD) buth you see the point of assembler.

the code was in inline assembly in function
void _mult_quat (D3DRMQUATERNION *r, const D3DRMQUATERNION *a, const D3DRMQUATERNION *b){    __asm    {     .... code    }}


P.S.
I am sory if this code is a discrase to ASM code buth i cant do any bether.
Red Drake
quote: Original post by RipTorn
I'd just like to add something extra:

quote:

No game coud posbly be writen in somthing like C# and look nice + go fast + be complex (on PC 1800/2000 mhz wich is normal today). If you dont trust me try running DX9 SDK samples for C++ & C# and then see witch goes faster.




Sorry but I beileve that to be wrong.

I forget the name of the company, but a while ago they compiled the quake2 source code using managed C++ (C++ for .net), ie, the same byte code that C# will produce.. And they stated the performance was approximatly 85% of the original C/asm version. In my opinion that is more than acceptable for a just-in-time compiled language. And as time goes on and the .net framwork is optimized, and net features are supported, this margin will shrink further (if not reverse).
That said quake2 is a very good testing example since it's BSP algorithms are extremly cpu dependant... look into the code and every single triangle is drawn with a sires of glBegin(GL_POLYGON)...glEnd() calls and goes through all the appropriate bsp culling... So with that, considering these gl calls were likly running through a thrid party .net GL library, and you may just have your 15% difference there alone. (plus you get all the significant advantages of the .net runtime - which is easily worth the 15%)


The code was compiled (Original Quake2) widouth any SIMD.
Why woud anybody use assembler widouth SIMD. Today compilers (c++) are good enough for anything else.

What are "advantages of the .net runtime" ???

p.s.
Dont think I wroute the code above. It was extracted from AMD SDK (I coud write it buth there is no point in reinventing it)

[edited by - Red Drake on June 3, 2004 2:50:17 PM]
Red Drake
maybe someone should do another experiment:
Take the Q1 source, set the rendering engine to use C, not ASM, and then convert it to C#
See how fast it is compared to the ASM version.
Advertisement
It woudl be pretty difficult to rewrite a faster version of this strcpy function in C or any other language. It assumes pure 7-bit ascii. A little bit of overhead is put over this by using it as inline asm instead of a separately assembled asm routine(a push/pop ebp and a pair of mov esp, ebp ops).

-steven

char *Strcpy(char *to, const char* from)
{
__asm
{
mov edx, to
mov ebx, from
TOP:
mov eax, [ebx]
mov ecx, eax
sub ecx, 0x01010101
and ecx, 0x80808080
jnz GOT_NULL
mov [edx], eax
/* Unroll the loop a little */
mov eax, [ebx+4]
mov ecx, eax
sub ecx, 0x01010101
and ecx, 0x80808080
mov [edx+4], eax
/* a little more */
mov eax, [ebx+8]
mov ecx, eax
sub ecx, 0x01010101
and ecx, 0x80808080
mov [edx+8], eax
/* one last time */
mov eax, [ebx+12]
mov ecx, eax
sub ecx, 0x01010101
and ecx, 0x80808080
mov [edx+12], eax
/* done unrolling */
add ebx, 16
add edx, 16
jmp TOP
GOT_NULL:
mov [edx], al
test al, al
je END

inc edx
shr eax, 8
mov [edx], al
test al, al
je END

inc edx
shr eax, 8
mov [edx], al
test al, al
je END

inc edx
mov [edx], 0
END:
}
return to;
}
How do I post my code in a nicely formatted white box as others have done instead of as unformatted text like my asm source above.

-steven
quote: Original post by unicityd
How do I post my code in a nicely formatted white box as others have done instead of as unformatted text like my asm source above.

-steven


use:
[ source ]
at beginign

[ /source ]
at end
(dont put spaces betwen [ and words )

Se FAQ abouth this (in beginers forum I think)

And dou you think that any compiler coud write an quaterion multiplication for AMD 3DNow bether than that (in my previous post) ??

[edited by - Red Drake on June 3, 2004 5:38:03 PM]
Red Drake
Red Drake,

I''m not familiar with 3dnow, but, asssuming that AMD''s own programmers know assembly fairly well and how to use 3dnow correctly, No.

When properly used, assembly is the fastest language. It''s not always the best or the safest, but it is the fastest. A good compiler may be able to produce better code than average asm programmers but a good assembly programmer will always be able to make better decisions.

If an assembly programmer can take the compiler generated asm and save even one clock cycle, he wins. It may not be worth the effort to save so little but, as I said, assembly isn''t always the best language to use. At best, a *perfect* compiler can match the best known assembly language implementation techniques.

This topic is closed to new replies.

Advertisement