Advertisement

Optimization ideas ?

Started by October 29, 2000 08:07 PM
19 comments, last by tcs 24 years, 2 months ago
Ok, sorry to say this but you are completley wrong here.

quote: Original post by Jumpster

            // Old version...cTexel[0] = (unsigned char) (imageData[iIndex + 0] * fLowTexWeight);cTexel[1] = (unsigned char) (imageData[iIndex + 1] * fLowTexWeight);cTexel[2] = (unsigned char) (imageData[iIndex + 2] * fLowTexWeight);// New version...cTexel[0] = (unsigned char) (imageData[iIndex++] * fLowTexWeight);cTexel[1] = (unsigned char) (imageData[iIndex++] * fLowTexWeight);cTexel[2] = (unsigned char) (imageData[iIndex++] * fLowTexWeight);            


I am not sure, but I believe this will also help you out a little-bit. If I am not mistaking, the asm code for
iIndex+0 equates to something like:

  mov eax, [iIndex]    add eax, 0  ...  mov eax, [iIndex]    add eax, 1  ...  mov eax, [iIndex]    add eax, 2  ...   


It seems to me, although I have not checked this out yet, that the iIndex++ would translate to:

  inc [iIndex]  ...  inc [iIndex]  ...  inc [iIndex]   


and provide the same results. Again, I am not sure, this is what I think seems like would happen.

At any rate, it doesn''t hurt to try it, right?


The first one will translate inte statements of the type:

mov esi,imageData
mov edi,cTexel
mov eax,[esi]
mov ebx,[esi+1]
...
mov [edi],eax
mov [edi+1],ebx
...

thats kinda fast and a good compiler will get it to fill both pentium pipes (note string instructions can be used to do this but!!! they are slower) (also when moving bytes probably it will move them in 32 bit chunks if possible)

whereas the second variant would go something like:
mov esi,imageData
mov edi,cTexel
mov eax,[esi]
inc esi
mov [edi],eax
inc edi
...repeat...
uh oh! this aint good... added complexety harder to get to fill both pipes holy macaroni this backfired.
note tho, the offset or index can sometimes add one cycle to the addressing count and that is the same cost that that of inc reg but actially when you know the number of items beeing moved the first version is actually better.

cheers!
HardDrop - hard link shell extension."Tread softly because you tread on my dreams" - Yeats
quote: Original post by DigitalDelusion

Ok, sorry to say this but you are completley wrong here.



Uh... Ok. So I am wrong. It''s a good thing I didn''t *insist* that it would be faster. I just thought that is what would happen. Thanks for the clarification.

Regards,
Jumpster
Regards,JumpsterSemper Fi
Advertisement
I won''t guarantee it is right, but it is at least close.

    for (i = 0; i < iTexSize; i++){    fTexIndex = pHeight->m_Array<i>[j] / 255.0f * iMaxTexture;    iHighTex = (int) ceil(fTexIndex);    iLowTex = (int) floor(fTexIndex);    fLowTexWeight  = iHighTex  - fTexIndex;    fHighTexWeight = fTexIndex - iLowTex;    if (iHighTex > iMaxTexture - 1)        iHighTex = iMaxTexture - 1;    if (iLowTex > iMaxTexture - 1)        iLowTex = iMaxTexture - 1;    if (fHighTexWeight == 0.0f && fLowTexWeight == 0.0f)    {        fHighTexWeight = 0.5f;        fLowTexWeight = 0.5f;    }    <TextureDataType> *ptTextureData = pTextureData + 3;    for (j = 0; j < iTexSize; j++)    {        cTGATexture *ptLowTex  = cTGATexture + iLowTex,                    *ptHighTex = cTGATexture + iHighTex;        iWidth  = ptLowTex->GetImageWidth();        iHeight = ptLowTex->GetImageHeight();        iLowIndex = (int) ((j % iWidth) * iWidth + (i % iHeight)) << 1;        iLowIndex += iLowIndex;        iWidth  = ptHighTex->GetImageWidth();        iHeight = ptHighTex->GetImageHeight();        iHighIndex = (int) ((j % iWidth) * iWidth + (i % iHeight)) << 1;        iHighIndex += iHighIndex;        <ImageDataType> *ptLowData  = ptLowTex->GetImageData()  + iLowIndex,                        *ptHighData = ptHighTex->GetImageData() + iHighIndex;        *(ptTextureData++) = (unsigned char)((*(ptLowData++)  * fLowTexWeight) +                                             (*(ptHighData++) * fHighTexWeight));        *(ptTextureData++) = (unsigned char)((*(ptLowData++)  * fLowTexWeight) +                                             (*(ptHighData++) * fHighTexWeight));        *(ptTextureData++) = (unsigned char)((*(ptLowData++)  * fLowTexWeight) +                                             (*(ptHighData++) * fHighTexWeight));        ptTextureData += iTexSize * 3;    }}    
Keys to success: Ability, ambition and opportunity.
Ok, so it wasn''t that close. Aside from some potential differances in rounding and truncation there is the small problem of fTexIndex being calculated off j, not i so moving it out of the inner loop isn''t valid. Considering the number of calculations based off j I would be tempted to move it to the outter loop and loop on i in the inner loop. I believe the only error then would be how ptTextureData is stepped. Well, aside from any other errors I missed, but hey, when you take advice off the internet...
Keys to success: Ability, ambition and opportunity.
ok, ive made a honest try but not having the full source kinda gave me a headach cuz the impossibility of compiling it and see if i broke it or not. So this is higly speculative.
But something like this would work, i guess...

    // Fill the texture with a combination of all landscape textures. Make a height based// per-texel choice of the source texture//ok,pass one move everything thats not j dependant out of the inner loop.//pass two move everything non i dependant out of the second loop.float iTex2 = 1.0 / iTexSize * iTexSize * 100.0f;int cTGAiLowTexTexWidth = cTGATextures[iLowTex].GetImageWidth();int iHighTexWidth = cTGATextures[iHighTex].GetImageWidth();	unsigned char *iLowTexData = cTGATextures[iLowTex].GetImageData();//i assume this return a pointer.unsigned char *iHighTexData = cTGATextures[iHighTex].GetImageData();	for (i=0; i<iTexSize; i++)	{int cTGATexThings = cTGAiLowTexTexWidth + (i % cTGATextures[iLowTex].GetImageHeight()); //wtf is this?int iHighTexThing = i % iHighTexWidthHeight;	// Calculate the two textures that are blended together			iHighTex = (int) ceil(fTexIndex);			iLowTex = (int) floor(fTexIndex);			// Don''t allow that we exceed the maximum texture count			if (iHighTex > iMaxTexture - 1)				iHighTex = iMaxTexture - 1;			if (iLowTex > iMaxTexture - 1)				iLowTex = iMaxTexture - 1;			// Calculate the weights of each texture			fHighTexWeight = fTexIndex - (float) floor(fTexIndex);			fLowTexWeight = (float) ceil(fTexIndex) - fTexIndex;			// Neccessary to avoid black textures when we directly hit a			// texture index			if (fHighTexWeight == 0.0f && fLowTexWeight == 0.0f)			{				fHighTexWeight = 0.5f;				fLowTexWeight = 0.5f;			}		for (j=0; j<iTexSize; j++)		{			// Update the progress window			CProgressWindow::SetProgress((unsigned int) ((i * (float) iTexSize + j) * iTex2							// Calculate the "average" texture index			fTexIndex = pHeight->m_Array<i>[j] / 255.0f * iMaxTexture;					// Calculate the texel offset in the lower texture array			iIndex = (int) ((j % cTGAiLowTexTexWidth) * cTGATexThings) * 3;			// Add the lower texture			cTexel[0] = (unsigned char)(iLowTexData[iIndex + 0] * fLowTexWeight);			cTexel[1] = (unsigned char)(iLowTexData[iIndex + 1] * fLowTexWeight);			cTexel[2] = (unsigned char)(iLowTexData[iIndex + 2] * fLowTexWeight);			// Calculate the texel offset in the higher texture array			iIndex = (int) ((j % iHighTexWidthWidth) * iHighTexWidthWidth + iHighTexThing) * 3;			// Add the higher texture			cTexel[0] += (unsigned char)(iHighTexData[iIndex + 0] * fHighTexWeight);			cTexel[1] += (unsigned char)(iHighTexData[iIndex + 1] * fHighTexWeight);			cTexel[2] += (unsigned char)(iHighTexData[iIndex + 2] * fHighTexWeight);			// Copy the texel to its destination			memcpy(&pTextureData[(j * iTexSize + i) * 3], cTexel, 3);		}	}    


Hope i got that atleast half right
as they say two wrongs doesn''t make one right it usually takes three or more
HardDrop - hard link shell extension."Tread softly because you tread on my dreams" - Yeats
quote: Original post by Jumpster

Uh... Ok. So I am wrong. It''s a good thing I didn''t *insist* that it would be faster. I just thought that is what would happen. Thanks for the clarification.

Regards,
Jumpster


Uhm, im not really sure if this is sarcasm in the air or what
Im truly sorry if you feel like i tried to stomp you, that really wasn''t the intention.

cheers!
HardDrop - hard link shell extension."Tread softly because you tread on my dreams" - Yeats
Advertisement
Actually, that was not sarcasm. I was simply admitting to my mistake as pointed out by you.

Regards,
Jumpster
Regards,JumpsterSemper Fi
How about this:

    inline int __stdcall _round( float  x ){  int  t;  __asm  fld   x  __asm  fistp t  return t;}// float -> BYTEinline BYTE __stdcall _round_u8( float  x ){  float  t = x + (float)0xC00000;  return *(BYTE*)&t}// floor for (x >= 0) && (x < 2^31)inline int __stdcall _floor_u( float  x ){  DWORD  e = (0x7F + 31) - ((*(DWORD*)&x & 0x7F800000) >> 23);  DWORD  m = 0x80000000 | (*(DWORD*)&x << 8);  return (m >> e) & -(e<32);}//=================================================================float  fTexIndexScale = (iMaxTexture-1) / 255.0f;BYTE  *pDst = pTextureData; // pTextureData[(j * iTexSize + i) * 3]for(int j=0; j<iTexSize; j++){  for(int i=0; i<iTexSize; i++)  {    // Calculate the "average" texture index    float  fTexIndex = pHeight->m_Array[ i ][ j ] * fTexIndexScale;    // Calculate the two textures that are blended together    int    iLowTex = _floor_u(fTexIndex);    // or this one (choose the fastest): // int    iLowTex = _round( fTexIndex - 0.5f );    float  fHighTexWeight = fTexIndex - iLowTex;    // Don''t allow that we exceed the maximum texture count    int  iOverflowMask = -(DWORD(iLowTex) < DWORD(iMaxTexture));    // VC6.0 can optimize it into (in pseudo-asm)    // cmp  iLowTex, iMaxTexture    // sbb  eax, eax // iOverflowMask = eax = (DWORD(iLowTex) < DWORD(iMaxTexture)) ? -1 : 0;    *(int*)&fHighTexWeight &= iOverflowMask; // if(iOverflowMask) fHighTexWeight = 0.0f;    iLowTex -= 1 + iOverflowMask; // if(!iOverflowMask) iLowTex--;    int  iHighTex = iLowTex + 1;    int  iLowTexWidth   = cTGATextures[iLowTex ].GetImageWidth ();    int  iLowTexHeight  = cTGATextures[iLowTex ].GetImageHeight();    int  iHighTexWidth  = cTGATextures[iHighTex].GetImageWidth ();    int  iHighTexHeight = cTGATextures[iHighTex].GetImageHeight();    // Calculate the texel offset in the lower texture array    // Calculate the texel offset in the higher texture array    // for positive integer numbers:    // if b==2^n  ->  a%b == a&(b-1)    int  iIndexL = (j & (iLowTexWidth  - 1)) * iLowTexWidth  + (i & (iLowTexHeight  - 1));    int  iIndexH = (j & (iHighTexWidth - 1)) * iHighTexWidth + (i & (iHighTexHeight - 1));    iIndexL += 2*iIndexL;    iIndexH += 2*iIndexH;    BYTE *pLowTexData  = cTGATextures[iLowTex ].GetImageData() + iIndexL;    BYTE *pHighTexData = cTGATextures[iHighTex].GetImageData() + iIndexH;    // Copy the texel to its destination    int  iLow0 = pLowTexData[0];    int  iLow1 = pLowTexData[1];    int  iLow2 = pLowTexData[2];    // _round() or _round_u8() (choose the fastest)    pDst[0] = iLow0 + _round(fHighTexWeight * (pHighTexData[0] - iLow0));    pDst[1] = iLow0 + _round(fHighTexWeight * (pHighTexData[1] - iLow1));    pDst[2] = iLow0 + _round(fHighTexWeight * (pHighTexData[2] - iLow2));    pDst += 3;  }}    
Im not going to try and come up with an optimized algorithm, enough people have thrown in their $0.02 on that. What really jumped out at me from looking at the code was that there was a lot of casting going on. I dont know what kind of perf hit youre taking by doing this, but it seems to me that casting would introduce a lot of extra processing. If youre going through this loop 65025 times, that adds up! My suggestion would be to overload your functions to give you the data type you need rather than casting each iteration.
There is no spoon.
Thax for all your help guys ! Casting was a BIG issue in all my routines ! I got this fast float/int casting asm code from nvidia.com, it gave me a 33% speed boost, never knew it was that expensive ;-)

Oh and I''m pretty familar with threads. But if I see that I have to handle critical sections and such stuff, just to update my progress bar... Better just move that crap out of the inner loop and you''re fine, 2048 calls to the bar are not very expensive. I''m really not that "oh it''s MFC, let''s better say its crappy before I try to understand it" guy. I''ve written lots of multithreaded winsock servers on such stuff, but I just think a threat won''t help much. Threads slow down a program, they can just make it more responsible for the user, and better to manage for the programmer. Since we''re talking about a loading screen of a game, there''s no user input to process. Except a cancel button, but that could be handled trough a return value of setprogress. I think I could update my progress bar in the loop.

Oh, and Stoffel: I want to excuse myself for this mindless rant, I just thought it would be somehow funny or whatever. It obiously wan''t, I hope you aren''t upset anymore !

Ok, I know will step trough all your code an try your suggestions, some are very cool!


Tim

--------------------------
glvelocity.gamedev.net
www.gamedev.net/hosted/glvelocity
Tim--------------------------glvelocity.gamedev.netwww.gamedev.net/hosted/glvelocity

This topic is closed to new replies.

Advertisement