How do I write inline functions in inline asm?
First take a look at my previus post Very stupid VC++ inline asm.
When your have read that post you should realize that there is no way you can write effective inline asm in VC++.
Not only must variables used in the asm code, be passed as stack variables (ebp-?). The return value must also be in a stack variable. This makes 4 completely unnecessary mov instructions, as the variables nearly always will be in a register, before and after the call.
In fact I think asm in a fastcall function would be faster than asm in an inline function, at least when using the mm registers. Because then there is no need for register perserving. The only overhead is the function call and the ret statement.
I would need inline asm to make a 3DNow optimized vector, matrix, quaternion library in C++. I already have the optimized assembler functions, from Amd:s SDK. I am also planning to add SSE optimized code to this library, using asm output from the Intel compiler.
Modifying the code generated by the compiler, by removing the unnecessary mov instructions is not a portable sollution. And there would also be much work serching for all inlined functions and change them all, or at least functions in tight inner loops.
I am also very surprised that there are no C++ compiler that can generate 3DNow optimized code. VectorC only generates c code. And could never think of using c for my matrix functions, that would make the code very unstructured and hard to read. I just love oop.
Fred Sundvik
October 11, 2000 07:49 PM
I think AMD has a VC6 add-in for creating amd optimized programs.
//this doesn''t do the type of thing you want?inline int foo(int a, int b) { __asm { mov eax, a; mov ebx, b; add eax, ebx; } //eax is returned by default }
No not exactly, it works fine when you write the parmeters directly in the function call. But when I write a program like this
inline int foo(int a, int b)
{ __asm
{ mov eax, a;
mov ebx, b;
add eax, ebx;
} //eax is returned by default
}
int main(int argc, char* argv[])
{
int b=0;
for (int a=5;a<10;a++) //this is only here
{ //to make sure
cout << a; //vc++ don''t use any smart
cout << b; //optimations
b++;
}
int c=foo(a,b);
cout << c;
return 0;
}
It will produce a disassembly like this
00401000 sub_401000 proc near ; CODE XREF: start+AFp
00401000
00401000 var_8 = dword ptr -8
00401000 var_4 = dword ptr -4
00401000
00401000 push ebp
00401001 mov ebp, esp
00401003 sub esp, 8
00401006 push ebx
00401007 push esi
00401008 push edi
00401009 xor edi, edi
0040100B mov esi, 5
00401010
00401010 loc_401010: ; CODE XREF: sub_401000+2Bj
00401010 push esi
00401011 mov ecx, offset dword_40B9B8
00401016 call ??6ostream@@QAEAAV0@H@Z ; ostream::operator<<(int)
0040101B push edi
0040101C mov ecx, offset dword_40B9B8
00401021 call ??6ostream@@QAEAAV0@H@Z ; ostream::operator<<(int)
00401026 inc edi
00401027 inc esi
00401028 cmp esi, 0Ah
0040102B jl short loc_401010
0040102D mov [ebp+var_8], esi ;here it
00401030 mov [ebp+var_4], edi ;copies
00401033 mov eax, [ebp+var_8] ;a and b
00401036 mov ebx, [ebp+var_4] ;to a temp
00401039 add eax, ebx
0040103B push eax
0040103C mov ecx, offset dword_40B9B8
00401041 call ??6ostream@@QAEAAV0@H@Z ; ostream::operator<<(int)
00401046 pop edi
00401047 pop esi
00401048 xor eax, eax
0040104A pop ebx
0040104B mov esp, ebp
0040104D pop ebp
0040104E retn
0040104E sub_401000 endp
This is all up to the specs, because c++ will make a copy when passing a variable straight. So I changed the code to take reference parameters. But this made the optimizer totally crazy:
00401000 var_10 = dword ptr -10h
00401000 var_C = dword ptr -0Ch
00401000 var_8 = dword ptr -8
00401000 var_4 = dword ptr -4
00401000
00401000 push ebp
00401001 mov ebp, esp
00401003 sub esp, 10h
00401006 mov eax, 5
0040100B push ebx
0040100C mov [ebp+var_4], 0
00401013 mov [ebp+var_8], eax
00401016
00401016 loc_401016: ; CODE XREF: sub_401000+40j
00401016 push eax
00401017 mov ecx, offset dword_40B9B8
0040101C call ??6ostream@@QAEAAV0@H@Z ; ostream::operator<<(int)
00401021 mov eax, [ebp+var_4]
00401024 mov ecx, offset dword_40B9B8
00401029 push eax
0040102A call ??6ostream@@QAEAAV0@H@Z ; ostream::operator<<(int)
0040102F mov ebx, [ebp+var_4]
00401032 mov eax, [ebp+var_8]
00401035 inc ebx
00401036 inc eax
00401037 cmp eax, 0Ah
0040103A mov [ebp+var_4], ebx
0040103D mov [ebp+var_8], eax
00401040 jl short loc_401016
00401042 lea ecx, [ebp+var_4]
00401045 lea edx, [ebp+var_8]
00401048 mov [ebp+var_C], ecx
0040104B mov [ebp+var_10], edx
0040104E mov eax, [ebp+var_10]
00401051 mov ebx, [ebp+var_C]
00401054 add eax, ebx
00401056 push eax
00401057 mov ecx, offset dword_40B9B8
0040105C call ??6ostream@@QAEAAV0@H@Z ; ostream::operator<<(int)
00401061 xor eax, eax
00401063 pop ebx
00401064 mov esp, ebp
00401066 pop ebp
00401067 retn
00401067 sub_401000 endp
The problem here seems that its impossible to pass registers as parameters. Inline __fastcall wont help either.
I have to check the eax return in my code.
I did not find any add-in on the amd site. Only 3DNow SDK.
inline int foo(int a, int b)
{ __asm
{ mov eax, a;
mov ebx, b;
add eax, ebx;
} //eax is returned by default
}
int main(int argc, char* argv[])
{
int b=0;
for (int a=5;a<10;a++) //this is only here
{ //to make sure
cout << a; //vc++ don''t use any smart
cout << b; //optimations
b++;
}
int c=foo(a,b);
cout << c;
return 0;
}
It will produce a disassembly like this
00401000 sub_401000 proc near ; CODE XREF: start+AFp
00401000
00401000 var_8 = dword ptr -8
00401000 var_4 = dword ptr -4
00401000
00401000 push ebp
00401001 mov ebp, esp
00401003 sub esp, 8
00401006 push ebx
00401007 push esi
00401008 push edi
00401009 xor edi, edi
0040100B mov esi, 5
00401010
00401010 loc_401010: ; CODE XREF: sub_401000+2Bj
00401010 push esi
00401011 mov ecx, offset dword_40B9B8
00401016 call ??6ostream@@QAEAAV0@H@Z ; ostream::operator<<(int)
0040101B push edi
0040101C mov ecx, offset dword_40B9B8
00401021 call ??6ostream@@QAEAAV0@H@Z ; ostream::operator<<(int)
00401026 inc edi
00401027 inc esi
00401028 cmp esi, 0Ah
0040102B jl short loc_401010
0040102D mov [ebp+var_8], esi ;here it
00401030 mov [ebp+var_4], edi ;copies
00401033 mov eax, [ebp+var_8] ;a and b
00401036 mov ebx, [ebp+var_4] ;to a temp
00401039 add eax, ebx
0040103B push eax
0040103C mov ecx, offset dword_40B9B8
00401041 call ??6ostream@@QAEAAV0@H@Z ; ostream::operator<<(int)
00401046 pop edi
00401047 pop esi
00401048 xor eax, eax
0040104A pop ebx
0040104B mov esp, ebp
0040104D pop ebp
0040104E retn
0040104E sub_401000 endp
This is all up to the specs, because c++ will make a copy when passing a variable straight. So I changed the code to take reference parameters. But this made the optimizer totally crazy:
00401000 var_10 = dword ptr -10h
00401000 var_C = dword ptr -0Ch
00401000 var_8 = dword ptr -8
00401000 var_4 = dword ptr -4
00401000
00401000 push ebp
00401001 mov ebp, esp
00401003 sub esp, 10h
00401006 mov eax, 5
0040100B push ebx
0040100C mov [ebp+var_4], 0
00401013 mov [ebp+var_8], eax
00401016
00401016 loc_401016: ; CODE XREF: sub_401000+40j
00401016 push eax
00401017 mov ecx, offset dword_40B9B8
0040101C call ??6ostream@@QAEAAV0@H@Z ; ostream::operator<<(int)
00401021 mov eax, [ebp+var_4]
00401024 mov ecx, offset dword_40B9B8
00401029 push eax
0040102A call ??6ostream@@QAEAAV0@H@Z ; ostream::operator<<(int)
0040102F mov ebx, [ebp+var_4]
00401032 mov eax, [ebp+var_8]
00401035 inc ebx
00401036 inc eax
00401037 cmp eax, 0Ah
0040103A mov [ebp+var_4], ebx
0040103D mov [ebp+var_8], eax
00401040 jl short loc_401016
00401042 lea ecx, [ebp+var_4]
00401045 lea edx, [ebp+var_8]
00401048 mov [ebp+var_C], ecx
0040104B mov [ebp+var_10], edx
0040104E mov eax, [ebp+var_10]
00401051 mov ebx, [ebp+var_C]
00401054 add eax, ebx
00401056 push eax
00401057 mov ecx, offset dword_40B9B8
0040105C call ??6ostream@@QAEAAV0@H@Z ; ostream::operator<<(int)
00401061 xor eax, eax
00401063 pop ebx
00401064 mov esp, ebp
00401066 pop ebp
00401067 retn
00401067 sub_401000 endp
The problem here seems that its impossible to pass registers as parameters. Inline __fastcall wont help either.
I have to check the eax return in my code.
I did not find any add-in on the amd site. Only 3DNow SDK.
Researching further I found out that you must mean the Microsoft processor pack. Well I had a look at it yesterday but I thought it only enables inline asm with 3DNow instructions. But when I looked at it again I saw that you can progam with some sort of intrinsics. I don''t know what it is yet, but I will have a look at it. I think it is what I need to let the compiler optimize the 3DNow code.
But that inline asm problem still exists. I wonder why I have not heard of this before. I have searched the whole gamedev forum, but nothing. I have also done many differnet searches in Dejanews, still nothing.
Have anyone actually made faster code with inline asm than the compiler itself? When I have disassembled the code I have seen how smart the compiler is. But when you put your own asm the optimition doesn''t work well anymore.
But that inline asm problem still exists. I wonder why I have not heard of this before. I have searched the whole gamedev forum, but nothing. I have also done many differnet searches in Dejanews, still nothing.
Have anyone actually made faster code with inline asm than the compiler itself? When I have disassembled the code I have seen how smart the compiler is. But when you put your own asm the optimition doesn''t work well anymore.
Well I think that inline asm in inline functions is not supported at all. I tried compiling a very simple inline function with inline asm in cpp builder, but it gave me an error that inline assembler is not allowed in inline and template functions.
I have also tried playing around with inline asm in vc++ a bit more, to find a sollution to my problem. But the only thing I found out was that the standard calling conversions is not followed in inline functions. For example I tried to make a function both fastcall and inline, and the inparameters as reference or pointers. But in this case it gave me one inparameter as esi, instead of ecx or edx. The function worked pefectly without the inline keyword.
I have also tried to use the intrinsics in the processorpack. But they don''t generate good code at all, when you for example multiply a register with itself and store the result in the same register. It adds a totally unneccesary temporary variable. The intrinsic can be made by just one assembly instruction.
I have also tried playing around with inline asm in vc++ a bit more, to find a sollution to my problem. But the only thing I found out was that the standard calling conversions is not followed in inline functions. For example I tried to make a function both fastcall and inline, and the inparameters as reference or pointers. But in this case it gave me one inparameter as esi, instead of ecx or edx. The function worked pefectly without the inline keyword.
I have also tried to use the intrinsics in the processorpack. But they don''t generate good code at all, when you for example multiply a register with itself and store the result in the same register. It adds a totally unneccesary temporary variable. The intrinsic can be made by just one assembly instruction.
I coded a simple grafix routine in inline asm and used the standard ed of MSVC to compile it, and then in tasm and linked it to MSVC. The linked function ran 10 times faster, the professional ed of MSVC makes more optimizations, but it''s out of my price range.
If you don''t like all of the extra overhead associated with the function and you are using MSVC; just declare the funtion as naked. Example:
Now all of the immediate b.s. before the function call is omitted. YOU just have to take care of it yourself. Hence the prolog, epilog stuff in the example. I find it hard to believe that those extra 5-lines of assembly code is going to kill your function''s performance.
But then, it''s possible that I totally misunderstand the whole topic of discussion. Just an idea though.
Regards,
Jumpster
Semper Fi
#include <windows.h>int __inline __declspec(naked) DoIt( int x, int y ){ __asm { ; // Prolog code here... push ebp mov ebp, esp sub esp, __LOCAL_SIZE ; // Actual code of the function... mov eax, dword ptr [x]; mov ebx, dword ptr [y]; xor eax, ebx; ; // Epilog code here... mov esp, ebp pop ebp ret }}int WINAPI WinMain( HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nShowCmd ){ return DoIt(10,20);}
Now all of the immediate b.s. before the function call is omitted. YOU just have to take care of it yourself. Hence the prolog, epilog stuff in the example. I find it hard to believe that those extra 5-lines of assembly code is going to kill your function''s performance.
But then, it''s possible that I totally misunderstand the whole topic of discussion. Just an idea though.
Regards,
Jumpster
Semper Fi
Regards,JumpsterSemper Fi
Jumpster:
Have you tried compiling this code? VC++ wont let a naked function go inline. A least not VC++ 6 enterprise with servicepack 4 and the processorpack. And even if it would compile, this code makes temporary variables, as I have explained erlier.
The fastest way to use inline asm I can think of is using naked fastcall functions. You have to use naked here, because the compiler makes a proglog and an epilog, if you use for example the ecx register. Consider the following code.
float __fastcall frVector::Length3DNow()
{
femms()
movq mm0,[ecx]this.data //it will not find data
movd mm1,[ecx]this.data+8 //direct, even if it is a
... //member
}
Because it is a fastcall, the this pointer will come in in ecx. But the specs says that ecx should be perserved. At least here case because this is a const pointer. Even if I am not changing the value of ecx, the compiler will make a prolog and an epilog, which basicly pushes and pops ecx. Declaring a such function as naked will eliminate that.
Well, this is naturally not the most efficient code. But I think I have to stick with the code the compiler generates, when using inline functions. After all the compiler generates very good code when using all optimations. For a litle more complex code you have to use the fastcall version.
But it could be very effecient to compile a different exe for 3D optimized code. Or alternatively compile to a dll and load the right dll at runtime, when the program starts, and run all code from it.
But while no one have came with a working suggestion, I think there would be no idea to compile different versions. I am going to use runtime checking for processor compability. Putting the flag in a static member variable will eliminate the most overhead with runtime checking. All modern processors have branch prediction, and it will be predicted correct nearly all the time. But I guess have to do a bit testing to know wich functions should be implemented with 3DNow and which should be without. I guess this would make nearly all vector functions inline without 3DNow support. And the rest, matrix and quaternion functions, with only the support checking routine inline and the actual functions as fastcalls.
Have you tried compiling this code? VC++ wont let a naked function go inline. A least not VC++ 6 enterprise with servicepack 4 and the processorpack. And even if it would compile, this code makes temporary variables, as I have explained erlier.
The fastest way to use inline asm I can think of is using naked fastcall functions. You have to use naked here, because the compiler makes a proglog and an epilog, if you use for example the ecx register. Consider the following code.
float __fastcall frVector::Length3DNow()
{
femms()
movq mm0,[ecx]this.data //it will not find data
movd mm1,[ecx]this.data+8 //direct, even if it is a
... //member
}
Because it is a fastcall, the this pointer will come in in ecx. But the specs says that ecx should be perserved. At least here case because this is a const pointer. Even if I am not changing the value of ecx, the compiler will make a prolog and an epilog, which basicly pushes and pops ecx. Declaring a such function as naked will eliminate that.
Well, this is naturally not the most efficient code. But I think I have to stick with the code the compiler generates, when using inline functions. After all the compiler generates very good code when using all optimations. For a litle more complex code you have to use the fastcall version.
But it could be very effecient to compile a different exe for 3D optimized code. Or alternatively compile to a dll and load the right dll at runtime, when the program starts, and run all code from it.
But while no one have came with a working suggestion, I think there would be no idea to compile different versions. I am going to use runtime checking for processor compability. Putting the flag in a static member variable will eliminate the most overhead with runtime checking. All modern processors have branch prediction, and it will be predicted correct nearly all the time. But I guess have to do a bit testing to know wich functions should be implemented with 3DNow and which should be without. I guess this would make nearly all vector functions inline without 3DNow support. And the rest, matrix and quaternion functions, with only the support checking routine inline and the actual functions as fastcalls.
Actually, as a matter of fact, I did compile the code. I didn''t encounter a problem at all. I''m using MSVC 6 Enterprise with Service Pack 4.
However, I do understand what you''re saying with the temporary variables. I''ll get back to ya.
Jumpster
Regards,
Jumpster
Semper Fi
However, I do understand what you''re saying with the temporary variables. I''ll get back to ya.
Jumpster
Regards,
Jumpster
Semper Fi
Regards,JumpsterSemper Fi
Have you tried to disassemble the file? The code is not inline.
The compiler wont make temporary variables when using constants, so you have to make sure that the compiler doesnt know the values, you are passing to the function, when calling it. Be aware that the compiler is quit smart, a simple loop will not stop it from knowing the values. But if you print them inside the loop, like I did earlier, or read them from the user, there should be no problems.
The compiler wont make temporary variables when using constants, so you have to make sure that the compiler doesnt know the values, you are passing to the function, when calling it. Be aware that the compiler is quit smart, a simple loop will not stop it from knowing the values. But if you print them inside the loop, like I did earlier, or read them from the user, there should be no problems.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement
Recommended Tutorials
Advertisement