No I haven''t looked at the assembly listings but if they are the same, why did my version consistantly perform slightly better? Interesting... Now I''ll have to look...
Regards,
Jumpster
a memset that doesn't suck
Yesterday then i profiled jumpsters code against the standard memset function, that lists the exact same way as NuffSaid said, i found out that it was a cache problem. Cause if i put jumpsters code right before the normal memset, the normal memset function was slightly faster and if i did it the opposit way with the normal memset first and then jumpsters code right after, jumpsters code was slightly faster.
/maq
/maq
- maq
Sounds interesting. Could someone explain to me what maq meant by the cache being an influence?
==========================================In a team, you either lead, follow or GET OUT OF THE WAY.
If you memcpy''ed a small amount of data (smaller than the L3/L2/L1 cache size) it''d be waiting there for the next memcpy - so which ever memset is called first gets a cache-miss penalty not applied to the second memcpy.
Copy ten megs, and if the difference goes away, it may very well be a cache-hit bonus on the second memcpy.
MSVC may add them for safety in debug mode, are they still inserted in a ''retail'' build?
Magmai Kai Holmlor
- The disgruntled & disillusioned
Copy ten megs, and if the difference goes away, it may very well be a cache-hit bonus on the second memcpy.
quote:
Notice the pushad/popad instructions are missing?
MSVC may add them for safety in debug mode, are they still inserted in a ''retail'' build?
Magmai Kai Holmlor
- The disgruntled & disillusioned
- The trade-off between price and quality does not exist in Japan. Rather, the idea that high quality brings on cost reduction is widely accepted.-- Tajima & Matsubara
February 15, 2001 08:09 PM
quote:
Original post by Magmai Kai Holmlor
Notice the pushad/popad instructions are missing?
MSVC may add them for safety in debug mode, are they still inserted in a ''retail'' build?
I believe MSVC always automatically preserves certain registers for you (though not all) in the function prologue and epilogue when you use __asm blocks. So, in all probability, the pushad/popad combo will be redundant even when compiling in release mode.
– Bevan
Cache miss penalty? But my numbers were not created by calling the two functions in sequence. I actaully created two seperate programs - Identical in every way except for the call that copies the memory. MemCopy32Bit() and memcpy(). The MemCopy32Bit was slightly faster in every execution of the program. If the code is the same, then why would that be?
Regards,
Jumpster
Regards,
Jumpster
Regards,JumpsterSemper Fi
hi
rep stosd is slow.
While a memory set or copy operaion you must take care of
alignment on DWORD as well as the cache omptimizations,
i think a loop which copies 16 bytes in each iteration
would be faster than just copying some dword. 16 comes from
the cache strip size.
rep stosd is slow.
While a memory set or copy operaion you must take care of
alignment on DWORD as well as the cache omptimizations,
i think a loop which copies 16 bytes in each iteration
would be faster than just copying some dword. 16 comes from
the cache strip size.
--MFC (The Matrix Foundation Crew)
I'm no computer architecture guru, but I'd just like to know how you're going to copy 16 bytes at each iteration? AFAIK, that's only possible if you've got a 128 bit bus, which many platforms don't have, right?
Edited by - NuffSaid on February 17, 2001 5:50:18 AM
Edited by - NuffSaid on February 17, 2001 5:50:18 AM
==========================================In a team, you either lead, follow or GET OUT OF THE WAY.
Good point, is there a write cache, or can writes be pipelined?
Magmai Kai Holmlor
- The disgruntled & disillusioned
Magmai Kai Holmlor
- The disgruntled & disillusioned
- The trade-off between price and quality does not exist in Japan. Rather, the idea that high quality brings on cost reduction is widely accepted.-- Tajima & Matsubara
if you want to get rid of the calling overhead you can use ''__declspec(naked)'' for the function, then MSVC does not add a prolog or epilog to the function...or you could just write the function as an asm file, and set custom build rules to compile it with masm (or the like).
And now a quick question, does anyone know how the AMD ''prefetch'' instruction works...I just get errors if I try to assemble ''prefetch eax'' (or whatever)...and would this possibly help in this situation (ie with writes)
And now a quick question, does anyone know how the AMD ''prefetch'' instruction works...I just get errors if I try to assemble ''prefetch eax'' (or whatever)...and would this possibly help in this situation (ie with writes)
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement
Recommended Tutorials
Advertisement