Optimisation we've been discussing recently
Here, opengl.org and at flipcode there has been at lot of noise about a ''faster X'', like the memcpy thread, their flip''s faster casting, at opengl fastest normalisation etc, etc
Has anyone ever collect all these in one place? (Besides tcs). Nvidia has a fastmath.h, tcs has one, most people have a few bits kicking around but without being an assemble programmer it''s hard to tell the difference between one fastest asm square root and another.
What other resources are ther out there for this kind of thing? For example my math library is currently written for readability. Soon however I''ll be writing a structure of classes that Allow me to define one routine and the fasted platform dependedent code fragment will be run.
I currently have a large detailed CPU class that detects dozens of processor types, times the CPU, counts instructions, etc, etc.
It''s a bit of a mess now but when it''s clean what other resource can I put in it?
Here are the optimised bits of code I''m aware of:
My asm timer (which needs a little work but handles threading)
Nvidia fast math
TCS''s fast math
3Dnow! SDK
?SSE SDK (Never did find that one)
Fastest Memset, memcpy here
Fastest Float to int (flipcode)
SSE Matrix library
Fastest Normalisation(opengl.org forums)
Fastest Power of Two
What else would fit well in to this kind of scheme? Are you aware of any links. Depending on the interest I''m happy to release the results of this quest to the public.
Something I know I need:
A good way of telling a telling a function declaration to use either the e.g AMD FastMemCpy32 or SSE FastMemCpy32. Function pointers? Static libs and have 2 exe''s? Idea''s?
I can do the grunt work of gathering the data organising and making it look pretty in code but I DONT know more than assembler basics so can''t tell more than which code fragment is fastest.
Many thanks, and I hope this turns out to be something useful.
Chris
Chris Brodie
I think you''d want to use function pointers, LaMonthe says they''re better than case statements.
I think you''d want to make a dll (or two or three) not a static lib. That way you can add support for more cpu''s (when the P4 is widely available, or the K8...) a little easier. You can determine the processor & load the appro. dll at run time then - rather than have seperate builds. If you use function pointers & one build, it''s no faster than a dll.
... might kill performance though. I usually inline most of my math ops - you don''t want to add function call overhead to a vector normalization.
You want to be able to change inlined code at run time. That''s not impossible, it''s also not easy. I don''t think you''re allowed to write to the code area from ring 3 (a user app) - but you can (usually not on purpose) from a driver in ring 0. Sooo, you could call a proxy function and insert a bunch of nops where ever you want to use speciailize code. Then, you send the list of proxy locations & a functions tag to the ''code stomper'' driver, he has a look at the cpu type etc, and loads the dll and copies the specialized code to each spot in the list and adds a jump to the end of the nop''s.
Maybe you could just make specialized builds & have the installer deterimine which one to run.
...
Cublic spline interpolation?
Magmai Kai Holmlor
- The disgruntled & disillusioned
I think you''d want to make a dll (or two or three) not a static lib. That way you can add support for more cpu''s (when the P4 is widely available, or the K8...) a little easier. You can determine the processor & load the appro. dll at run time then - rather than have seperate builds. If you use function pointers & one build, it''s no faster than a dll.
... might kill performance though. I usually inline most of my math ops - you don''t want to add function call overhead to a vector normalization.
You want to be able to change inlined code at run time. That''s not impossible, it''s also not easy. I don''t think you''re allowed to write to the code area from ring 3 (a user app) - but you can (usually not on purpose) from a driver in ring 0. Sooo, you could call a proxy function and insert a bunch of nops where ever you want to use speciailize code. Then, you send the list of proxy locations & a functions tag to the ''code stomper'' driver, he has a look at the cpu type etc, and loads the dll and copies the specialized code to each spot in the list and adds a jump to the end of the nop''s.
Maybe you could just make specialized builds & have the installer deterimine which one to run.
...
Cublic spline interpolation?
Magmai Kai Holmlor
- The disgruntled & disillusioned
- The trade-off between price and quality does not exist in Japan. Rather, the idea that high quality brings on cost reduction is widely accepted.-- Tajima & Matsubara
Try looking up the kernel function VirtualAlloc and taking note of the flProtect parameter. Would it be possible to load in a DLL''s functions directly into the EXE''s memory, replacing the generic functions at run-time? This would take less space for the binaries than if you had compiled multiple EXE''s for this purpose, and as far as I know it would be faster than if you had placed the functions into DLLs.
Hmm...I had read about compressed EXE files and the like, so I think it is possible for an EXE to modify its own code. It does sound fairly complicated, though. Anybody know more about this?
Hmm...I had read about compressed EXE files and the like, so I think it is possible for an EXE to modify its own code. It does sound fairly complicated, though. Anybody know more about this?
quote:
Would it be possible to load in a DLL''s functions directly into the EXE''s memory, replacing the generic functions at run-time?
It''d be a lot easier just to have a small "stubs" DLL, that did some CPU detection, then just hooked all of its API''s to the appropriate DLL.
Any of the asm people here know that the instruction overhead is of calling a function straight from a DLL?
This seems like the cleanest solution so far. (I don''t wan''t the library to be so complicated that people will be afraid of using it. Of course the overhead of the switching mechanism will decide how efficient it is to use a CPU specialise function in the library. If for example it takes 10 clocks to call a 10 clock faster sqrt then nobody would bother.
So, anyone willing to look the instruction count of calling a function in a dll dynamically bound at runtime (not including the initial dll binding.)?
In honest I''v never needed DLL''s before this (I have a background in COM though). Here is how I think I would need to load the DLL functions:
void *DllLoad(const std::string &a_Name)
{
Handle = LoadLibrary(a_Name.c_str());
return (void *)Handle;
}
then
void *DllGetFunction(void *handle, const std::string &a_Name)
{
return GetProcAddress((HMODULE)handle, a_Name.c_str());
}
After that however I think the void* pointer it hands back is actually just a code pointer(function block) and can be used in just the same way as a normal code block (ie I think it''s just a bunch of instructions at a memory address so it -should- be the same shouldn''t it. If thats correct we''re only talking a function pointer.
Isn''t that just as fast at runtime? I mean, a normal function(not inlined) call is just a jump to a code block at a different address.
Anyone want to set me straight as to what I''ve said? Anyone want to read an asm listing of a dll function call to confirm?
Thanks all...
Chris
This seems like the cleanest solution so far. (I don''t wan''t the library to be so complicated that people will be afraid of using it. Of course the overhead of the switching mechanism will decide how efficient it is to use a CPU specialise function in the library. If for example it takes 10 clocks to call a 10 clock faster sqrt then nobody would bother.
So, anyone willing to look the instruction count of calling a function in a dll dynamically bound at runtime (not including the initial dll binding.)?
In honest I''v never needed DLL''s before this (I have a background in COM though). Here is how I think I would need to load the DLL functions:
void *DllLoad(const std::string &a_Name)
{
Handle = LoadLibrary(a_Name.c_str());
return (void *)Handle;
}
then
void *DllGetFunction(void *handle, const std::string &a_Name)
{
return GetProcAddress((HMODULE)handle, a_Name.c_str());
}
After that however I think the void* pointer it hands back is actually just a code pointer(function block) and can be used in just the same way as a normal code block (ie I think it''s just a bunch of instructions at a memory address so it -should- be the same shouldn''t it. If thats correct we''re only talking a function pointer.
Isn''t that just as fast at runtime? I mean, a normal function(not inlined) call is just a jump to a code block at a different address.
Anyone want to set me straight as to what I''ve said? Anyone want to read an asm listing of a dll function call to confirm?
Thanks all...
Chris
Chris Brodie
There is no SSE SDK. SSE is a set of instructions that Pentium 3s have. You write assembly code that takes advantage of those instructions. The same goes for MMX and the Pentium 4''s SSE2. MMX is integer-only and somewhat limited, whereas SSE and SSE2 deal exclusively with floating-point numbers. SSE[2] is not a replacement for MMX, in fact the Pentium 3 also added new, very useful, MMX instructions.
~CGameProgrammer( );
~CGameProgrammer( );
~CGameProgrammer( );
Developer Image Exchange -- New Features: Upload screenshots of your games (size is unlimited) and upload the game itself (up to 10MB). Free. No registration needed.
The overhead of calling a dll function is the same as using a function pointer.
You''d want to dynamically laod the dll, not statically load it too, btw. But that''s not a good solution for things like vector normalization - you''d have to load whole sections of the game from the dlls, to minimize the function call overhead...
Magmai Kai Holmlor
- The disgruntled & disillusioned
You''d want to dynamically laod the dll, not statically load it too, btw. But that''s not a good solution for things like vector normalization - you''d have to load whole sections of the game from the dlls, to minimize the function call overhead...
Magmai Kai Holmlor
- The disgruntled & disillusioned
- The trade-off between price and quality does not exist in Japan. Rather, the idea that high quality brings on cost reduction is widely accepted.-- Tajima & Matsubara
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement
Recommended Tutorials
Advertisement