Actually, after a second read your response made perfect sense!

You definately have to keep in mind that hardware operations are going to be MUCH faster in most cases that CPU operations because of proximity to the actually video memory. However, there are a few considerations you must keep in mind. Basically, video->video blits are the fastest, then comes system->system blits, and last comes system->video blits (leaving out AGP stuff). So basically, if your game wants to display 1,000 different sprites, you have two options:
1) Leave the sprites in system memory, use a backbuffer in system memory, create your frame, then with one mighty blit copy it to the video card's backbuffer and Flip(). This is ideal when you can't/won't place your sprites in video memory.
2) Put your sprites in video memory, use the hardware blitter to blit to the video backbuffer, then Flip()
All of this makes little difference however if you are blitting maybe 30-50 times a frame. Or if you have a P2 350+! 
Also, large blitting counts may make you want to simply Lock() the backbuffer and memcpy() or use a custom software blitter to copy the data in. This is because every blit in DirectX requires a Lock()/Unlock() sequence (hidden to the programmer) that locks the memory completely using a Win16 mutex (i believe).
Anyway, have fun with that. I hope I was helpful. This week has kinda been my introduction to programming forums, so my teaching skills may not be up to par yet.
- Splat