Advertisement

a memset that doesn't suck

Started by February 10, 2001 03:58 PM
39 comments, last by Shannon Barber 23 years, 11 months ago
Easiest way is to find out the word size of the computer you''re compiling it for.

a windows machine would be 32 bits, so get a void pointer to the array you want to clear, and cast it to an unsigned integer (32 bits on win32), and make a for loop to fill in each index with an unsigned int.

The compiler takes care of everything. (and if its worth its salt, it will keep the values in the CPU registers automatically)

Of course there are ASM commands that fill entire segments of memory with a word, but using ASM commands is unportable, and may not work on some CPU''s, so it''s best to use a C++ friendly way to do this.

===============================================
Hurry up madness, hurry up disease,
hurry up insanity, hurry up please.
Hooray! I say, for the end of the world.
This is my signature. There are many like it, but this one is mine. My signature is my best friend. It is my life. I must master it as I must master my life. My signature, without me, is useless. Without my signature, I am useless.
I don''t know assembly, but here''s something I just found from a DJGPP page.


// Using STOSD / STOSB
memset( szBuffer, 0, sizeof(szBuffer) );
401101: MOV ECX,00000020
401106: XOR EAX,EAX
401108: LEA EDI,[EBP-80]
40110B: REP STOSD
Advertisement
I''m trying to fill 32bpp dd7 surfaces with a background color, and I shamelessly ripped the function Jumpster posted.
I think I''ll mod it slightly to copy my audio buffers around as well. Both the dd7 surfaces & the audio buffer are DWORD aligned. The surfaces are automatically so, & a simple
dwBytes -= dwBytes%4; aligns the sound buffers

Thanks all!


Magmai Kai Holmlor
- The disgruntled & disillusioned
- The trade-off between price and quality does not exist in Japan. Rather, the idea that high quality brings on cost reduction is widely accepted.-- Tajima & Matsubara
I really hope you''re not using modulo - that''s sooo slow. AND it by 3 instead.



"NPCs will be inherited from the basic Entity class. They will be fully independent, and carry out their own lives oblivious to the world around them ... that is, until you set them on fire ..." -- Merrick

"It is far easier for a camel to pass through the eye of a needle if it first passes through a blender" -- Damocles
"NPCs will be inherited from the basic Entity class. They will be fully independent, and carry out their own lives oblivious to the world around them ... that is, until you set them on fire ..." -- Merrick
If your just going to clear a dd7 surface with a background color, use the blitter ( dd7surf->Blt( bla, bla, COLOR_FILL, &ddfx_color ) ) it''s higly optimized already. But if your like me have casted your surface to a dword array, i''ll recommend you to write your own mmx memset function. It will look like jumpsters code above but will take advantage of the bigger and faster mmx registers mm0, mm1... Look at the openPTC or tinyPTC source and you will get a lot of hints (www.gaffer.org/ptc/).

/maq
- maq
quote:
Original post by Magmai Kai Holmlor

I''m trying to fill 32bpp dd7 surfaces with a background color


Tried using a straight colorfill? I think it''s supposed to be faster.
Advertisement
Here you go Magmia; a fast MemCopy function. This function will be moderately slower than the MemSet32Bit() function because unlike the MemSet32Bit(), it takes alignment into consideration. Therefore, the "int size" this time is the number of bytes to copy and not the number of dwords... Plus, since we're copying from memory to memory, we have a bus-limitation to be concerned with but fortunately, most chipsets can overcome the bulk of this problem with "bursts"...

You can add the "if (size > 0)" check yourself.


          __inline void MemCopy32Bit( void* dest, void* source, int size ){  __asm  {    // Prelimenary setup stuffs...    pusha;             // Push all registers onto the stack...    mov  esi, source;      mov  edi, dest;        mov  ebx, size;            // How many Alignment bytes do we need...?    mov  ecx, ebx;         and  ecx, 3;       // Same as ecx % 4...    test ecx;          // Is an alignment required...?    jz   move_dwords;  // No. An alignment is not required...    sub  ebx, ecx;     // Subtract the alignment bytes...    rep  movsb;        // Move 1-byte at a time ...        // Let's begin moving the alignment bytes...  move_dwords:    mov  ecx, ebx;     // Reset our DWORD counter...    shr  ecx, 2;       // Convert from number of bytes to number of DWORDS...    rep  movsd;        // Move 4-bytes at a time to destination...      popa;              // Restore all registers from the stack...  }}        



Regards,
Jumpster

Edit: Often times I forget the difference in jz and jnz instructions. I don't know why but I can never remember when to use what... - Anyway, if you find that ECX == 0 after the AND ECX, 3 - and the JZ drops you to the SUB EBX, ECX command, just change the JZ to JNZ to correct the problem. I just wrote this on the fly for you so I haven't tested it any. If you have any other problems, let me know and I will correct them.




Edited by - Jumpster on February 13, 2001 11:55:11 AM
Regards,JumpsterSemper Fi
I tried to make a faster version of memset a few weeks ago. But I didn''t manage to get it any faster than the one VC++ suplies.
I even tried moving 8 bytes a time( instead of four) with MOVQ without any improvements.

Has anyone managed to make a memset that utilizes MMX in an efficient manner??


------------------------------
- Derwiath -
------------------------------- Derwiath -
Sorry folks. The compiler didn''t like the code above because of the int size parameter name. Caused some weird compiler errors with the mov ecx,ebx instructions. Here''s an update of the code slightly modified for improvements...

  __inline void MemCopy32Bit( void* dest, void* source, int count ){  __asm  {    // Prelimenary setup stuffs...    mov  esi, source;      mov  edi, dest;        mov  ecx, count;           // Let''s begin moving the DWORD bytes...    shr  ecx, 2;       // Convert from number of bytes to number of DWORDS...    rep  movsd;        // Move 4-bytes at a time to destination...      // How many Alignment bytes do we need...?    mov  ecx, count;    and  ecx, 3;       // Same as ecx % 4...    test ecx, ecx;     // Is an alignment required...?    jz   mov_done;     // No. An alignment is not required...    rep  movsb;        // Move 1-byte at a time to destination...  mov_done:  }}  



Notice the pushad/popad instructions are missing? as it turned out, the code was included twice so I just removed mine... Don''t know why but oh well...

Anyway, here''s the benchmarks compared to memcpy().

MemCopy32Bit() -> Release = 64.3MB p/s -> Debug = 63.7MB p/s
memcpy() -> Release = 61.2MB p/s -> Debug = 59.0MB p/s

That''s what, between 5% to 7% difference? It may not be worth the trouble if considering portability issues... but here it is for your review...

Regards,
Jumpster
Regards,JumpsterSemper Fi
I don''t know about you guys, but have you taken a look at the assembly listings?

For the memsets that I''ve used, MSVC seems to have changed them to what Jumpster wrote. This is in the Standard ed, so I''m not sure if the Pro version will be better.
==========================================In a team, you either lead, follow or GET OUT OF THE WAY.

This topic is closed to new replies.

Advertisement