I guess the important message as always is to only optimized where needed. The point in the high-level language to start with is so that you can get much more done in less time, and more reliably as well. Use assembler only where it is needed. That niche is becoming smaller and smaller as compilers get better but it still does exist.
Basically make sure to profile BEFORE optimizing. Don''t spend time and convolute your code by optimizing the wrong parts. Generally a normal program will only have 1 or 2 bottlenecks in very small sections of code (we''re talking on the order of 10 lines here). These are easy to recode in assembler to see if they are faster.
In my experience, RARELY will I get a performance boost by changing general code to ASM. I''m not claiming to be the best ASM programmer who ever lived, but all humility aside, I can write some fairly good code. The only place where I get any speed increase is when converting vectorized code to SIMD, and even then I often find that the code is simply cache/memory bandwidth limited, and the SIMD makes no difference at all.
So while I do agree that assembly programming is interesting and sometimes useful, it certainly isn''t what it was years ago when you could get on the order of a 10x speed increase by recoding in a lower level language.
Assembly : When is it worth your time?
Why doesn''t someone post a small, generic, asm program, and then challenge everyone to make a C/C++ version that runs as fast (with a decent compiler).
Also, is there a simple equivalent to bsf in C?
Also, is there a simple equivalent to bsf in C?
quote: Original post by Tree Penguinquote: Original post by RipTornint values[5000][5000];void doSomething(){ for (int y=0; y<5000; y++) for (int x=0; x<5000; x++) values[x][y]++;}
would actually be _extremly_ inefficient code... and the compiler would likly have to be very smart to optimize that... Yet I''d bet that occurs in a lot of people code. This code is not very efficient with the stack, but far more importantly it''s hugly inefficient with the cache. The thing is it would be hard to write it with the same inefficiency in ASM...
My problem is some of my alogrithms are ugly but also more complex than this one so it would be hard to optimise those in ASM, however i do think it might make a difference (and probably not just a small one) because of the use of stacked loops and such.
[edited by - Tree Penguin on June 3, 2004 5:32:21 AM]
int values[5000][5000];void doSomething(){ for (int y=0; y<5000; ++y) for (int x=0; x<5000; ++x) ++values[x][y];}
is that anymore efficient?
how would optimize that code to make it as efficient as possible?
quote: Original post by DBX
Why doesn''t someone post a small, generic, asm program, and then challenge everyone to make a C/C++ version that runs as fast (with a decent compiler).
Also, is there a simple equivalent to bsf in C?
__asm { femms mov eax,[a] mov ecx,[b] movq mm0,[eax] ; mm0 = ax|aw movq mm1,[ecx] ; mm1 = bx|bw pfmul mm0,mm1 ; mm0 = ax*bx|aw*bw movq mm2,[eax+8] ; mm2 = az|ay movq mm3,[ecx+8] ; mm3 = bz|by mov edx,[r] movq mm5,mm1 ; mm5 = bx|bw movq mm4,mm0 ; mm4 = ax*bx|aw*bw punpckhdq mm0,mm0 ; mm0 = ax*bx|ax*bx pfmul mm2,mm3 ; mm2 = az*bz|ay*by pfsub mm4,mm0 ; mm4 = ???|aw*bw-ax*bx punpckhdq mm5,mm5 ; mm5 = bx|bx movq mm6,mm3 ; mm6 = bz|by pfacc mm2,mm2 ; mm2 = ???|az*bz+ay*by punpckldq mm5,mm1 ; mm5 = bw|bx pfmul mm5,[eax] ; mm5 = ax*bw|aw*bx punpckhdq mm6,mm6 ; mm6 = bz|bz pfsub mm4,mm2 ; mm4 = ???|rw movq mm7,[eax+8] ; mm7 = az|ay pfacc mm5,mm5 ; mm5 = ???|ax*bw+aw*bx punpckldq mm6,mm3 ; mm6 = by|bz pfmul mm6,mm7 ; mm6 = az*by|ay*bz movd [edx],mm4 ; mm4 = ???|rw movd mm2,[eax] ; mm2 = ???|aw movq mm0,mm1 ; mm0 = bx|bw pfadd mm5,mm6 ; mm5 = ???|ax*bw+aw*bx+ay*bz punpckhdq mm6,mm6 ; mm6 = ay*bz|ay*bz punpckldq mm0,mm0 ; mm0 = bw|bw punpckldq mm2,mm2 ; mm2 = aw|aw pfsub mm5,mm6 ; mm5 = ???|rx pfmul mm0,mm7 ; mm0 = az*bw|ay*bw punpckhdq mm7,[eax+0] ; mm7 = ax|az pfmul mm2,mm3 ; mm2 = bz*aw|by*aw punpckhdq mm1,mm1 ; mm1 = bx|bx movd [edx+4],mm5 ; mm5 = ???|rx pfadd mm2,mm0 ; mm2 = az*bw+bz*aw|by*aw+ay*bw punpckldq mm1,mm3 ; mm1 = by|bx pfmul mm7,mm1 ; mm7 = ax*by|az*bx movd mm4,[eax+4] ; mm3 = ???|ax punpckhdq mm3,[ecx+8] ; mm3 = bx|bz pfadd mm2,mm7 ; mm2 = az*bw+bz*aw+ax*by|by*aw+ay*bw+az*bx punpckldq mm4,[eax+4] ; mm4 = ay|ax pfmul mm3,mm4 ; mm3 = bx*ay|bz*ax pfsub mm2,mm3 ; mm2 = rz|ry movq [edx+8],mm2 ; mm2 = rz|ry femms
The quaterion multiplication using 3DNow!.
Give it your best.
The code is not so generic (only for AMD) buth you see the point of assembler.
the code was in inline assembly in function
void _mult_quat (D3DRMQUATERNION *r, const D3DRMQUATERNION *a, const D3DRMQUATERNION *b){ __asm { .... code }}
P.S.
I am sory if this code is a discrase to ASM code buth i cant do any bether.
Red Drake
quote: Original post by RipTorn
I'd just like to add something extra:quote:
No game coud posbly be writen in somthing like C# and look nice + go fast + be complex (on PC 1800/2000 mhz wich is normal today). If you dont trust me try running DX9 SDK samples for C++ & C# and then see witch goes faster.
Sorry but I beileve that to be wrong.
I forget the name of the company, but a while ago they compiled the quake2 source code using managed C++ (C++ for .net), ie, the same byte code that C# will produce.. And they stated the performance was approximatly 85% of the original C/asm version. In my opinion that is more than acceptable for a just-in-time compiled language. And as time goes on and the .net framwork is optimized, and net features are supported, this margin will shrink further (if not reverse).
That said quake2 is a very good testing example since it's BSP algorithms are extremly cpu dependant... look into the code and every single triangle is drawn with a sires of glBegin(GL_POLYGON)...glEnd() calls and goes through all the appropriate bsp culling... So with that, considering these gl calls were likly running through a thrid party .net GL library, and you may just have your 15% difference there alone. (plus you get all the significant advantages of the .net runtime - which is easily worth the 15%)
The code was compiled (Original Quake2) widouth any SIMD.
Why woud anybody use assembler widouth SIMD. Today compilers (c++) are good enough for anything else.
What are "advantages of the .net runtime" ???
p.s.
Dont think I wroute the code above. It was extracted from AMD SDK (I coud write it buth there is no point in reinventing it)
[edited by - Red Drake on June 3, 2004 2:50:17 PM]
Red Drake
maybe someone should do another experiment:
Take the Q1 source, set the rendering engine to use C, not ASM, and then convert it to C#
See how fast it is compared to the ASM version.
Take the Q1 source, set the rendering engine to use C, not ASM, and then convert it to C#
See how fast it is compared to the ASM version.
It woudl be pretty difficult to rewrite a faster version of this strcpy function in C or any other language. It assumes pure 7-bit ascii. A little bit of overhead is put over this by using it as inline asm instead of a separately assembled asm routine(a push/pop ebp and a pair of mov esp, ebp ops).
-steven
char *Strcpy(char *to, const char* from)
{
__asm
{
mov edx, to
mov ebx, from
TOP:
mov eax, [ebx]
mov ecx, eax
sub ecx, 0x01010101
and ecx, 0x80808080
jnz GOT_NULL
mov [edx], eax
/* Unroll the loop a little */
mov eax, [ebx+4]
mov ecx, eax
sub ecx, 0x01010101
and ecx, 0x80808080
mov [edx+4], eax
/* a little more */
mov eax, [ebx+8]
mov ecx, eax
sub ecx, 0x01010101
and ecx, 0x80808080
mov [edx+8], eax
/* one last time */
mov eax, [ebx+12]
mov ecx, eax
sub ecx, 0x01010101
and ecx, 0x80808080
mov [edx+12], eax
/* done unrolling */
add ebx, 16
add edx, 16
jmp TOP
GOT_NULL:
mov [edx], al
test al, al
je END
inc edx
shr eax, 8
mov [edx], al
test al, al
je END
inc edx
shr eax, 8
mov [edx], al
test al, al
je END
inc edx
mov [edx], 0
END:
}
return to;
}
-steven
char *Strcpy(char *to, const char* from)
{
__asm
{
mov edx, to
mov ebx, from
TOP:
mov eax, [ebx]
mov ecx, eax
sub ecx, 0x01010101
and ecx, 0x80808080
jnz GOT_NULL
mov [edx], eax
/* Unroll the loop a little */
mov eax, [ebx+4]
mov ecx, eax
sub ecx, 0x01010101
and ecx, 0x80808080
mov [edx+4], eax
/* a little more */
mov eax, [ebx+8]
mov ecx, eax
sub ecx, 0x01010101
and ecx, 0x80808080
mov [edx+8], eax
/* one last time */
mov eax, [ebx+12]
mov ecx, eax
sub ecx, 0x01010101
and ecx, 0x80808080
mov [edx+12], eax
/* done unrolling */
add ebx, 16
add edx, 16
jmp TOP
GOT_NULL:
mov [edx], al
test al, al
je END
inc edx
shr eax, 8
mov [edx], al
test al, al
je END
inc edx
shr eax, 8
mov [edx], al
test al, al
je END
inc edx
mov [edx], 0
END:
}
return to;
}
How do I post my code in a nicely formatted white box as others have done instead of as unformatted text like my asm source above.
-steven
-steven
quote: Original post by unicityd
How do I post my code in a nicely formatted white box as others have done instead of as unformatted text like my asm source above.
-steven
use:
[ source ]
at beginign
[ /source ]
at end
(dont put spaces betwen [ and words )
Se FAQ abouth this (in beginers forum I think)
And dou you think that any compiler coud write an quaterion multiplication for AMD 3DNow bether than that (in my previous post) ??
[edited by - Red Drake on June 3, 2004 5:38:03 PM]
Red Drake
Red Drake,
I''m not familiar with 3dnow, but, asssuming that AMD''s own programmers know assembly fairly well and how to use 3dnow correctly, No.
When properly used, assembly is the fastest language. It''s not always the best or the safest, but it is the fastest. A good compiler may be able to produce better code than average asm programmers but a good assembly programmer will always be able to make better decisions.
If an assembly programmer can take the compiler generated asm and save even one clock cycle, he wins. It may not be worth the effort to save so little but, as I said, assembly isn''t always the best language to use. At best, a *perfect* compiler can match the best known assembly language implementation techniques.
I''m not familiar with 3dnow, but, asssuming that AMD''s own programmers know assembly fairly well and how to use 3dnow correctly, No.
When properly used, assembly is the fastest language. It''s not always the best or the safest, but it is the fastest. A good compiler may be able to produce better code than average asm programmers but a good assembly programmer will always be able to make better decisions.
If an assembly programmer can take the compiler generated asm and save even one clock cycle, he wins. It may not be worth the effort to save so little but, as I said, assembly isn''t always the best language to use. At best, a *perfect* compiler can match the best known assembly language implementation techniques.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement