Advertisement

ASM Optimisation

Started by March 13, 2000 07:05 AM
3 comments, last by GEo 24 years, 6 months ago
Hi, Last Thursday I was at a friends house, and we were discussing how to improve a program he wrote a while back. The program played a CD and did a load of pretty stuff on the screen (being vague to keep this post short ). We decided that before anything more could be added, the existing code would have to be optimised. One of the most demanding routines was the blur effect, which (as I''m sure you already know) makes each pixel equal to the average of its surrounding pixels. We wrote a short program to profile the code, executing the routine 1000 times,and timing how long it took in 1/18.2 seconds (timer ticks!). The original code took 147, after a few more optimisations (including replacing SUBs with ADDs) we got 131, and were quite impressed. Then I tried replacing lines such as: ADD di, xxx Mov es:[di], yyy with: Mov es:[di+xxx], yyy I wasn''t expecting this to make much (if any difference), but the profile now returned 60!!! Thats a 60% (approx.) improvement over the original code!! We ran the profile a couple of times, and checked it for bugs etc. Why the hell is this so much faster? PS: If this thread is still active tomorrow, I will post the actual code (It is short, I don''t have it just now), I''m going home now, so I won''t read any replies until tomorrow. George. "Who says computer games affect kids, imagine if PacMan affected us as kids, we'd all sit around in a darkened room munching pills and listening to repetitive music....uh oh!"
George. F"Who says computer games affect kids, imagine if PacMan affected us as kids, we'd all sit around in a darkened room munching pills and listening to repetitive music....uh oh!"
Hi there,

I just asked a colleague of me (Jacco Bikker, a.k.a. the Phantom) and he said the following:

ADD di, xxx
Mov es:[di], yyy

As opposed to:

Mov es:[di+xxx], yyy

has two drawbacks.

1. The first thing is that in the second instruction you get the addition of di+xxx for free. Adress calculations like these are for free (standard base and index calculations) cause they can be done in parralel in the pipeline.
2. THe second drawback is that in the following code:

ADD di, xxx
Mov es:[di], yyy

the procesor stalls after the add instruction because the second instruction needs the result of the first instruction. This causes an AGI stall (Adress generation interlock stall), cause the two instructions cannot be executed in the standard pipeline way (where the next instruction is executed before the latter one has ended).

Thanks for posting this question cause I learned something from it too!!

Jaap Suter
____________________________Mmmm, I''ll have to think of one.
Advertisement
s98.. is right, but if this code is running on a pentium, then the real reason for speedup could be much more complicated, because of pairing,shadowing and caching.
Optimizing for pentium is a whole science, such natural things like using lookup tables and unrolling loops that worked great on 386 could have severe negative effects on pentium.
One way to go about optimizing on it is to test, profile,make a change and test and profile again ....
for profiling you can use RDTSC instruction, which returns processor internal clock counter.
Another way is to dig into manuals, and spend ten minutes on every instruction calculating cycles and combining best pairing instructions ...

goto http://www.nightflight.com/~pcg/docs.html for some starters

-kertropp

Edited by - kertropp on 3/14/00 3:18:30 AM
-kertropp C:Projectsrg_clueph_opt.c(185) : error C3142: 'PushAll' :bad ideaC:Projectsrg_clueph_opt.c(207) : error C324: 'TryCnt': missing point
Thanks tonnes for your feedback people,

In case your intrested, the next message will contain the original & optimised code (wait a few minutes for me to sort that out).

In reply to Kertropp: I have very little real experience with ASM, and I know absolutely nothing about Pentium ASM (although I probably have the info. lying around at home somewhere) I only really know the more frequently used 8086 commands, which I use for the occaisional optimisation of demanding routines.
However I looked briefly at the website you recommended and it''s definitely getting bookmarked!

cheers everyone.

George.

"Who says computer games affect kids, imagine if PacMan affected us as kids, we'd all sit around in a darkened room munching pills and listening to repetitive music....uh oh!"

George. F"Who says computer games affect kids, imagine if PacMan affected us as kids, we'd all sit around in a darkened room munching pills and listening to repetitive music....uh oh!"
<<<-The Original Code->>>

mov es, ax
mov di, 320

mov cx, 63680
@1:

xor ax, ax
xor bx, bx

sub di, 320
mov bl, [es:di]
add ax, bx

add di, 319
mov bl, [es:di]
add ax, bx

add di, 2
mov bl, [es:di]
add ax, bx

add di, 319
mov bl, [es:di]
add ax, bx

shr ax, 2

sub di, 320
mov [es:di], al

inc di

loop @1

<<<-END->>>

I optimised this by replacing the SUBs with ADDs, and then removed the ADDs, so the code looked like this:

<<<-OPTIMISED CODE->>>

mov es, ax
mov di, 320

mov cx, 63680
@1:

xor ax, ax

mov bl, [es:di+65216]
add ax, bx

mov bl, [es:di+65535]
add ax, bx

mov bl, [es:di+1]
add ax, bx

mov bl, [es:di+320]
add ax, bx

shr ax, 2

mov [es:di], al

inc di

loop @1

<<<-END->>>

George.

"Who says computer games affect kids, imagine if PacMan affected us as kids, we'd all sit around in a darkened room munching pills and listening to repetitive music....uh oh!"


Edited by - GEo on 3/14/00 5:28:23 AM
George. F"Who says computer games affect kids, imagine if PacMan affected us as kids, we'd all sit around in a darkened room munching pills and listening to repetitive music....uh oh!"

This topic is closed to new replies.

Advertisement