Advertisement

a memset that doesn't suck

Started by February 10, 2001 03:58 PM
39 comments, last by Shannon Barber 23 years, 11 months ago
The bus is only 64 bits wide, but that isn''t the only consideration. Kingston has an explaination of how memory works. If you are interested in understanding why reading and then writing 128 bits might be faster than using 64 bits inspite of a 64 bit bus I would suggest reading it. Intel also has a write up on how cache works.
Keys to success: Ability, ambition and opportunity.
quote:
Original post by LilBudyWizer

The bus is only 64 bits wide, but that isn''t the only consideration. Kingston has an explaination of how memory works. If you are interested in understanding why reading and then writing 128 bits might be faster than using 64 bits inspite of a 64 bit bus I would suggest reading it. Intel also has a write up on how cache works.



I''ve tried www.intel.com and I''ve downloaded the 3 volumes of the Intel Software Developer''s Manual. You mean I''ve got to go through >2000 pages of info just to know how cache works on a the P6 line of chips? *throws head back in horror*.

Anyway, I''ve tried looking up www.kingston.com but the site seems to be down or non-existent. Guess I''ll have to do drop by the library and look through the books on computer architecture.
==========================================In a team, you either lead, follow or GET OUT OF THE WAY.
Advertisement
Sorry, I should have posted the links. Kingston - Ultimate Memory Guide and Intel - An Overview of Cache. The cache paper is a bit light on details, but it is a good starting point.
Keys to success: Ability, ambition and opportunity.

Hi
Bus width is another story, for cache optimizations it is better
to work on 16 bytes chuncks of data.

--MFC (The Matrix Foundation Crew)
--MFC (The Matrix Foundation Crew)
Hehe,
thanks for the info guys. I''ve downloaded the PDFs and I''m going to read them soon, when I''ve got the time.

With regards to the Intel Software Developers Manual that I downloaded before, I''ve gone through the first 60 or so pages already
==========================================In a team, you either lead, follow or GET OUT OF THE WAY.
Some people have suggested using MMX to move 8 bytes at a time. Well, you cannot do that. MMX is not for dealing with 64-bit data, it is for dealing with multiple 8-, 16-, or 32-bit data. Pentiums are not 64-bit processors, and MMX does not emulate 64-bit functionality.

MMX is basically only useful in loops that do many operations per iteration. For example, when drawing pixels, you could put the data for the first pixel (such as the U, V, X, and Y coordinates, and the gouraud-shading color values) in the lower DWORDs of the MMX registers, and the second pixel's data in the upper DWORDs. Then you do whatever you need to do with those registers, and thus process two pixels simultaneously.

[edit] The slowdown is in the loading and unloading of data to or from the MMX registers. You can only load or unload data in the U-pipe, and to put data in the upper DWORDs of MMX registers, you need to first move it to the lower part, then shift it. Also, non-MMX instructions cannot pair with MMX instructions that handle memory or non-MMX registers. In addition, you can't say movq blah, [mm0]; memory indices must be in 32-bit form, not 64-bit. But although MMX isn't good for many things, it is great for a few. Just not moving data =)

To explain it better, here's MMX code to move a QWORD at a time, which is 5 clocks per loop:

movd eax, Dest
movd ebx, Src
mov ecx, Size
do_loop:
CLOCK1.U: movq mm3, QWORD PTR [ebx]
CLOCK2.U: movd DWORD PTR [eax+4], mm3
CLOCK2.V: psrlq mm3, 32
CLOCK3.U: movd DWORD PTR [eax], mm3
CLOCK4.U: add eax, 8
CLOCK4.V: add ebx, 8
CLOCK5.U: dec ecx
CLOCK5.V: jnz do_loop

And the faster (4 clocks per loop) non-MMX version:

mov eax, Dest
mov ebx, Src
mov ecx, Size
sub ebx, 4
do_loop:
CLOCK1.U: mov edx, DWORD PTR [ebx+4]
CLOCK1.V: add ebx, 8
CLOCK2.U: mov DWORD PTR [eax], edx
CLOCK2.V: mov edx, DWORD PTR [ebx]
CLOCK3.U: mov DWORD PTR [eax+4], edx
CLOCK3.V: add eax, 8 ; Clock3.U reads eax, it doesn't write it
CLOCK4.U: dec ecx
CLOCK4.V: jnz do_loop

They both do the same number of loop iterations, but the non-MMX version is faster. It's only faster by one clock per loop, but it is faster.

Also, Visual C++ DOES use rep movsd. When you have two objects, and you set one to be equal to another, the compiler copies their data using it. It's only the standard library functions that rely on the byte version rep movsb.

~CGameProgrammer( );



Edited by - CGameProgrammer on February 20, 2001 11:51:10 AM

~CGameProgrammer( ); Developer Image Exchange -- New Features: Upload screenshots of your games (size is unlimited) and upload the game itself (up to 10MB). Free. No registration needed.
Advertisement
I was wondering what kind of speed anyone has gotten on a copy. I played around a little and with the rep movsd around 127MB/s was about the best I got. Using movaps to load all eight xmm registers and then a movntps to store the registers got me up to around 160MB/s. That is with a Pentium III 450MHZ with PC100 memory.

Just as a seperate note someone said above that the cache line is 16 bytes. According to the system programming guide for the Pentium the lines are 32 bytes.
Keys to success: Ability, ambition and opportunity.
How do you determine which pipeline a command is issued on, and how do you determine exactly what instruction starts at what clock tick?
- The trade-off between price and quality does not exist in Japan. Rather, the idea that high quality brings on cost reduction is widely accepted.-- Tajima & Matsubara
Rules. Agner Fog published a large document on writing assembly for Pentiums I, II, and III, and the Intel documents are also good (especially for MMX). To write optimized code, you should know about pairing and pipelining to get an idea of how your code is being executed (assembly code, not C++).

~CGameProgrammer( );

~CGameProgrammer( ); Developer Image Exchange -- New Features: Upload screenshots of your games (size is unlimited) and upload the game itself (up to 10MB). Free. No registration needed.
Regretably Intel doesn''t put all the information from previous versions of their manuals in current versions. Chapter 3.6 of this version of the Intel Architecture Referance Manual explains pairing rules for integer instructions. To find it from the Intel main page you have to select Developer/Intel Developer Services, then Products/Intel Archtecture/IA-32 Processors, then Pentium Processor from near the bottom of the processor list. Finally select manuals in the left-bottom. As this demonstrates it is sometimes useful to look at older versions of the manuals.
Keys to success: Ability, ambition and opportunity.

This topic is closed to new replies.

Advertisement