Advertisement

Optimization...

Started by December 01, 2000 01:17 PM
5 comments, last by biskit 24 years, 1 month ago
I know this has nothing to do with the forum''s topic, but I was hoping to get some help in here seeing as I got little to none in Graphics Programming and Theory. Here is my problem. I am implementing my own blitter, and need to optimize it. No alpha blending in this code, or support for color keys. I have already implemented versions of my code with both of these features, but I want to optimize this thing in pieces. If anyone could give me some pointers on how to speed it up(I''ll rewrite it in asm later), I would be greatly appreciative. Here is the code:
  
int EXPORT CSurface::HandDrawSurface(CSurface *destSurface, int x, int y)
{
	DWORD	*pSrc,
			*pDest;
	long	srcPitch,
			destPitch;
	RECT	src;
	DWORD	dwTemp;

	if (!destSurface)
		return NULL;

	x = x>>1;	// divide by 2

	y = y>>1;	// divide by 2


	src.left	= src.top = 0;
	src.right	= GetSurfaceWidth()>>1;
	src.bottom	= GetSurfaceHeight();

	LockSurface();
	destSurface->LockSurface();
	
	pSrc		= (DWORD*)GetSurfacePtr();
	pDest		= (DWORD*)destSurface->GetSurfacePtr();
	srcPitch	= (DWORD)GetSurfacePitch()>>2;
	destPitch	= (DWORD)destSurface->GetSurfacePitch()>>2;

	
	pDest += (y*destPitch) + x;


	for (int y1=0;y1<src.bottom;y1++)
	{
		for (int x1=0;x1<src.right;x1++)
		{
			*pDest = *pSrc;

			pSrc++;
			pDest++;
		}
		pDest	+= (destPitch-src.right);
		pSrc	+= (srcPitch-src.right);
	}


	destSurface->UnlockSurface();
	UnlockSurface();

	return TRUE;
}
  
I looked at the blitter in the CDX library and noticed that they used do-while loops to copy the pixels. Do the do-while loops require less overhead thus making them faster?
----------------------------------------Implementation is everything. Period.
Use memcpy for every row
Advertisement
quote: Original post by biskit

Here is my problem. I am implementing my own blitter, and need to optimize it.


It looks pretty good to me already. Just a few ideas, most of which will probably make next to no difference whatsoever.

x = x>>1;	// divide by 2y = y>>1;	// divide by 2  


Try 'x >>= 1' and 'y >>= 1' instead. There's a chance that the compiler doesn't optimise out the temporary variable and the accompanying assignment. The >>= operation should equate directly to a single shift operation in asm.

LockSurface();destSurface->LockSurface();	pSrc		= (DWORD*)GetSurfacePtr();pDest		= (DWORD*)destSurface->GetSurfacePtr();srcPitch	= (DWORD)GetSurfacePitch()>>2;destPitch	= (DWORD)destSurface->GetSurfacePitch()>>2;  


Ok, this will probably make zero difference, but swap the order of those instructions to look like this:
LockSurface();pSrc		= (DWORD*)GetSurfacePtr();srcPitch	= (DWORD)GetSurfacePitch()>>2;destSurface->LockSurface();pDest		= (DWORD*)destSurface->GetSurfacePtr();destPitch	= (DWORD)destSurface->GetSurfacePitch()>>2;  


The only difference here (unless I misunderstand your code) is that, instead of alternating between modifying this surface and the other surface, we work with this surface first for 3 operations, then work on the other surface. These 2 surfaces might be quite some distance apart in memory and therefore working on just 1 at a time -might- mean it makes better use of any cache memory on the CPU and/or video card. Total speculation though.

pDest	+= (destPitch-src.right);pSrc	+= (srcPitch-src.right);  

You don't need to recalculate the difference between the pitch and the right-hand side of the src every time through the y-loop, do you? How about putting this above both loops:

destDiff = destPitch-src.right;
srcDiff = srcPitch-src.right;

...and change the quoted lines above to...
pDest += destDiff;
pSrc += srcDiff;

And, as with any performance optimisation, be sure to test these to make sure it actually make a difference

quote:
I looked at the blitter in the CDX library and noticed that they used do-while loops to copy the pixels. Do the do-while loops require less overhead thus making them faster?


If I remember correctly, Do...while loops are only more efficient than For or While loops in that For and While loops have to jump back up to the top to evaluate their condition. So, when a For loop has just performed its last operation, it jumps to the top, evaluates the condition, then has to jump to the end again. A Do...While loop has the condition at the bottom, and so will not have to do those extra 2 jumps. ASM experts, feel free to correct me here. Also note that a compiler might optimise things in ways you didn't expect, making any such assumptions of code layout potentially void.

I hope some of that is of use to you.



Edited by - Kylotan on December 4, 2000 12:53:51 PM
If you decide to rewrite is asm make sure to use rep movst or somethig like that then you only set es and ds once and subsequently only set si and di once for every scanline part.

This will be the fastest I think you can get it....

EOT MR Master
EOT MR Master
Thanks for the help guys! Kylotan, i don''t believe that using the compound shifting operator is going to help(??), seeing as the shift instruction puts the value in eax, then it has to be moved into x, either way. However, I see your point about computing the differences one time, that should have slapped me in the face. As for the order of locking surfaces and such, the way my classes work, when a surface is locked, the pointer to it''s surface is stored internally, so the GetSurfacePointer() is an inline method that just returns that void* to the surface. I don''t know, maybe I''m missing your point. And I see your point about using do..while loops, thanks. And thanks for the advice MR_MASTER. I''m not really sure whether I''m going to rewrite it in asm or not. I got my alpha blender working really well, so I''m not sure. Again, let me thank you guys for the help. This is a lot more help than I got in the Graphics Programming and Theory forum. Again, thank you!



----------------------------------------
Implementation is everything. Period.
Particle Toast
----------------------------------------Implementation is everything. Period.
quote: Original post by biskit
As for the order of locking surfaces and such, the way my classes work, when a surface is locked, the pointer to it''s surface is stored internally, so the GetSurfacePointer() is an inline method that just returns that void* to the surface. I don''t know, maybe I''m missing your point.


My point is that you are using 2 instances of your CSurface class, right? That means you are accessing 2 different areas in memory when you need to alter/read any of the data members in those CSurface instances. Those areas of memory could be quite far apart, which can cause the cache to have to keep reloading areas of memory for each instruction. So, assuming CSurface 1 is at 0x100000000 and CSurface2 is at 0x80000000, instead of doing this:

read address 0x10000000
read address 0x80000000 // possibly requiring a cache refresh
read address 0x10000000 // possibly requiring a cache refresh
read address 0x80000000 // possibly requiring a cache refresh
read address 0x10000000 // possibly requiring a cache refresh
read address 0x80000000 // possibly requiring a cache refresh

... where every jump of 0x70000000 may require the cache to be refreshed (sorry guys, I don''t know the -real- term for that...a page fault, maybe? *shrugs*), you can reorder this to:

read address 0x10000000
read address 0x10000000
read address 0x10000000
read address 0x80000000 // possibly requiring a cache refresh
read address 0x80000000
read address 0x80000000

The difference is that you have cut out the "jumps" from one area of memory to another, from 5 down to 1.

I don''t truly believe it''ll make any noticeable difference, but theoretically, it could. I think the technical term for this is ''locality of reference''.

quote: Again, let me thank you guys for the help. This is a lot more help than I got in the Graphics Programming and Theory forum. Again, thank you!


No problem. That''s what we''re here for

Advertisement
This is more a less a straight conversion of your inner loop code. Assuming everthing
is four bytes and you are writing for a 32 bit processor that is.

If not then movsb for bytes,movsw for words.

// make sure direction flag is forward

cld
mov esi,pSrc
mov edi,pDest

mov eax,num_rows
mov ebx,dest_diff
mov ecx,num_cols
mov edx,src_diff

row_lop:
// preserve col wid
push ecx

col_lop:
// *pdest=*psrc

// movsd increment edi=destination,esi=source automatically
movsd
loopnz col_lop
pop ecx

// add differences for next row
add edi,ebx
add esi,edx
// decrement row counter
dec eax
jnz row_lop


for (int y1=0;y1 {
for (int x1=0;x1 {
*pDest = *pSrc;

pSrc++;
pDest++;
}
pDest += (destPitch-src.right);
pSrc += (srcPitch-src.right);
}


However it can be made faster if you want to blit the entire column in one go.

cld
mov esi,pSrc
mov edi,pDest

mov eax,num_rows
mov ebx,dest_diff
mov ecx,num_cols
mov edx,src_diff

row_lop:
push ecx
rep movsd
pop ecx
// add differences for next row
add edi,ebx
add esi,edx
// decrement row counter
dec eax
jnz row_lop

I am not even saying this is the fastest way. However it does depend on other factors
whether you are clipping and so on. None of this code currently does that. I suggest you
detect when an object needs clipping and send those objects through a slower clipper
version.

Remember to try and concentrate on those inner loops. They are the ones that kill ya




This topic is closed to new replies.

Advertisement