Advertisement

Assembly in MDS (to include in C/C++)

Started by May 04, 2000 04:01 PM
5 comments, last by baskuenen 24 years, 7 months ago
Hey fellow coders, In a moment I''m starting optimization of my 3D engine. My current version is 100% C/C++, and I''m satisfied with current algorithms. I''ve already created a custom build in MDS and am going to use the .obj created. This should all work fine. The real questions are: - Can I use my old TASM compiler? (MASM is not my favorite) - Does anyone have experience with this? What''s your advice? - Can I use __fastcall? - Can I use all registers? - What''s the state of the FPU, and does it change after function calls? - May I change the state of the FPU? (In multitasked Windows) - Or is it better to simply use inline asm? Please help, any advice appreciated, Bas.
> - Can I use my old TASM compiler? (MASM is not my favorite)
TASM shouldn''t be a problem.. compile to COFF .obj and link them in.. I don''t have experience in linking in separate .asm files using MSVC.. i used to do this back in the DOS days, but then i realized that the stack frames tend to defeat the purpose of using assembly, at least in my case... so i inline it instead. anyway..

> - Can I use __fastcall?
i *believe* (i could be wrong here), but __fastcall doesn''t always result in the arguments getting passed in registers, it''s merely a suggestion for the compiler. in watcom, you just defined an extern using #pragma aux and specified which registers to use for the parameters, but i''m not really sure how to force it in MSVC. if you do figure it out, by all means, let me know .

> - Can I use all registers?

you should be able to.

> - What''s the state of the FPU, and does it change after function calls?

when inlining asm functions or using _asm within a function, the parameters are passed on the FPU stack starting at the head, i.e. the first floating parameter is in st(0). how it changes after the function call depends on your function .

> - May I change the state of the FPU? (In multitasked Windows)

yes.

> - Or is it better to simply use inline asm?

well, don''t take this as gospel, but i prefer to use only inline asm. the reason for that is that the only time you really need hand optimized assembly is within extremely time critical loops, and if you have to actually perform function calls at that point, in my opinion, that defeats the purpose.

-goltrpoat



--
Float like a butterfly, bite like a crocodile.

--Float like a butterfly, bite like a crocodile.
Advertisement
Thanx for responding!

Ok, you've convinced me, inline is the way to go!
Then there's this problem I just can't figure out:

I want to unrole my inner loop (draw poly innerloop), and am planning to use something like this:

void *JmpTbl[320] =
{ jmp320, jmp319, jmp318, ... , jmp2, jmp1, jmp0
};

__asm {
mov ecx,iWidth
mov edi,pBuffer
mov eax,iValue
mov ebx,iValuePlus
jmp JumpTbl[ecx*4]

jump320:
mov 320[edi], ah
add eax,ebx

jump319:
mov 319[edi], ah
add eax,ebx

jump318:
mov 318[edi], ah
add eax,ebx

...

jump1:
mov 1[edi], ah
add eax,ebx

jump0:
mov [edi], ah
}


Ah wel, looks nice, but how do I fill my JmpTable with these asm labels? Is void * oke?

Is there a setting for the MSVC compiler to go over the source a second time?

Thanks ahead,
Bas.

Edited by - baskuenen on 5/4/00 5:05:14 PM
You probably would say: "fill this table at runtime with a function defined after the variable declaration and function".

I know, but I''m a lazy and stubburn person.
Isn''t there another way to use these labels as static data, instead of filling this tbl runtime?

This problem will probably come back, if nobody has the answer.

Any MSVC guru''s out there?
Please respond,

Bas.
How do you know that unrolling the loop will even help? Aren''t you worried about wrecking the cache with a big table like that?
I know what you mean,

With the new pentium pipelining you can process two instructions in only one instruction cycle!

mov 320[edi],ah
add eax,ebx

Takes only one cycle to process.
If you put it in a loop you get something like this:

loop:
mov 320[edi],ah
add eax,ebx
dec ecx
jnc loop

and this will take 2 cycles per pixel.
So there''s no question which one''s faster. But you probably already knew this.


The cache used for code is small enough to live with (screen width 320).

But you got a good point about the JmpTable cache.
Dont take this wrong, but at the same time I''m writing (and later on reading) to and from a uge screen buffer. Isn''t this even worse for the data cache?

I hope I''m right. My assembly is probably getting a bit rusty. Maybe a good thing to lookup the old optimization docs.

If you''re interrested: Some time ago I received some official Intel Pentium optimization docs.
You can find them on my homepage in the resources. (see link on top). Some files are double and maybe it''s a bit messy - but thats the way I received them.
Hope you like it.

Thanx for reponse,
Bas.
Advertisement
oh geez.. you'll probably have to figure out the first offset manually and fill the table with offset+i*(whatever is the length of mov 320[edi],ah/add eax,ebx), can't really think of any other way to do it

-goltrpoat


Edited by - goltrpoat on 5/5/00 3:40:18 AM
--Float like a butterfly, bite like a crocodile.

This topic is closed to new replies.

Advertisement