Advertisement

Trying to finding bottlenecks in my renderer

Started by December 07, 2017 07:55 AM
37 comments, last by Matias Goldberg 7 years, 1 month ago
On 12/21/2017 at 7:51 AM, Eternal said:

You're never using the results of your matrix multiplication, so the entire benchmarking loop will be optimized away in release mode.

Good catch if correct, I hadn't thought about that.

On 12/20/2017 at 2:06 PM, Matias Goldberg said:

Taking 40ms for doing 40k matrix multiplications per frame for a single core sounds about correct.

I calculated about 24ms worst case, and about 9.6ms for a decent implementation for 40k transforms.

edit - for a 2.5ghz oooe processor... assuming a tight loop.

16 hours ago, Matias Goldberg said:

This will yield much better performance. Even then, it's not ideal, because accessing a different matrix every 6 threads in a wavefront will lead to bank conflicts.

Aren't bank conflicts for LDS and (in some implementations) register access?  I don't think constant buffers are stored in LDS.

edit - I would say memory coalescing was relevant but a matrix currently takes up a whole cache line.

 

-potential energy is easily made kinetic-

16 hours ago, Matias Goldberg said:

Instancing will not lead to good performance, as each sprite will very likely will be given its own wavefront unless you're lucky (on an AMD GPU, you'll be using 9.4% of processing capacity while the rest is wasted!)

Oh and I've never confirmed this but I read somewhere that some GPU's can pack a wavefront/warp with work from different instances.  They also had said that while with instancing work can be packed, execute/draw indirect currently isn't packed into a single wavefront/warp.

-potential energy is easily made kinetic-

Advertisement
1 hour ago, Infinisearch said:

Oh and I've never confirmed this but I read somewhere that some GPU's can pack a wavefront/warp with work from different instances.  They also had said that while with instancing work can be packed, execute/draw indirect currently isn't packed into a single wavefront/warp.

I got confirmation from an AMD Driver engineer himself. Yes, it's true. However don't count on it. The driver can only merge your instancing into the same wavefront if several conditions are met. I don't know the exact conditions, but they're all HW limitation related. i.e. if the driver cannot 100% guarantee the GPU can always merge your instances without rendering artifacts, then it won't (even if it were completely safe given the data you're going to be feeding, but the driver doesn't know that a priori, or it would take a considerable amount of CPU cycles to determine so).

3 hours ago, Infinisearch said:

Aren't bank conflicts for LDS and (in some implementations) register access?  I don't think constant buffers are stored in LDS.

edit - I would say memory coalescing was relevant but a matrix currently takes up a whole cache line.

When it comes to AMD, access to global memory may have channel and bank conflict issues. NVIDIA implements it as huge register file, so there's always a reason...

On 12/21/2017 at 5:10 AM, noodleBowl said:

since each sprite has its own model matrix

Do you really need whole matrix for each sprite? Perhaps you could limit it just to position? Then you will save a lot of space and increase max number of instances.

2 hours ago, Matias Goldberg said:

When it comes to AMD, access to global memory may have channel and bank conflict issues. NVIDIA implements it as huge register file, so there's always a reason...

Channel conflicts I suppose are possible.  But again what are the bank conflict issues you speak of if memory access's in different threads of a wavefront access the same cache line they are coalesced if not they are serialized.  Unless you are saying the L1 is banked, I don't see any room for bank conflicts.  And as far as Nvidia goes what exactly are you saying gets implemented as a huge register file?  Could you include a source if possible?

-potential energy is easily made kinetic-

On 12/21/2017 at 7:51 AM, Eternal said:

You're never using the results of your matrix multiplication, so the entire benchmarking loop will be optimized away in release mode.

I'm not sure if this is the problem. I might still be doing it wrong, but I went back and attempted to use the result, but I'm still get 0ms :/

Spoiler


//Setup code for the data is here

QueryPerformanceCounter(&startTime);
for (int i = 0; i < INT_MAX; ++i)
{
	for (int j = 0; j < 4; ++j)
	{
      		//smMat (input matrix), smVec (input vector), smRes (output result)
		simdMul(smMat, smVec, smRes);
	}
}
QueryPerformanceCounter(&endTime);
std::cout << "New RAW SIMD Solution TIME: " + std::to_string((double)((endTime.QuadPart - startTime.QuadPart) * 1000000) / (double)frq.QuadPart) + "micro" << std::endl;

//Attempt to use the result
float rData[4];
_mm_store_ps(rData, smRes.data);

 

 

On 12/21/2017 at 7:06 PM, Matias Goldberg said:

This will yield much better performance. Even then, it's not ideal, because accessing a different matrix every 6 threads in a wavefront will lead to bank conflicts.

A more optimal path would be to update the vertices using a compute shader that processes all 6 vertices in the same thread, thus each thread in a wavefront will access a different bank (i.e. one thread per sprite).

Can you explain what a wavefront, warp, and bank conflict is? I never really heard of these terms before

From what I understand a bank conflict is where I'm trying to reuse the same memory bank that I'm already working with (this is what I gather from with the link you provided). As for a wavefront, I believe this is a grouping of threads that are executed on the gpu. I'm not sure what a warp is. I'm also not sure what the significance of a wavefront is
 

20 hours ago, Zaoshi Kaba said:

Do you really need whole matrix for each sprite? Perhaps you could limit it just to position? Then you will save a lot of space and increase max number of instances.

Honestly, no not really. I was really using the matrices cause I wanted to be able scale and rotate my sprites along with the normal position transforms, BUT I can definitely do all of those things without a matrix. 

I went back and reworked it to not use matrices at all. Using the same testing conditions I'm maxing out at ~175K 64x64 sprites when running in release mode, also when I say maxing out I mean that I'm maintaining right at/just above 60 fps. The time it takes for my SpriteRenderer::render() method is ~9.5ms on average in these conditions. This is average is based on the time it takes for 1000 SpriteRenderer::render() executions.

I'm not really sure how I feel about this, because a part of me says "good way better!" and the other part says "it could better...". But I feel like in order to get it to be better I need to start exploring different methods such as the ones described by @Matias Goldberg and others. Also looking into using multithreading, since I'm definitely only using a single thread

 

Advertisement
8 minutes ago, noodleBowl said:

I'm not really sure how I feel about this, because a part of me says "good way better!" and the other part says "it could better...".

Are you still constructing vertices on CPU? If yes, try moving that to GPU using instancing.

One of ways to do that would be to create:

  1. vertex buffer that contains only 4 vertices (there's a way to achieve same without vertex buffer at all, but let's keep it simple for now)
  2. instancing buffer that contains position, scale (optional), rotation (optional), texcoord

Then you'd just set them both and render N number of instances of 4 vertices and use information from instancing buffer to transform vertices into sprites.

33 minutes ago, noodleBowl said:

I'm not sure if this is the problem. I might still be doing it wrong, but I went back and attempted to use the result, but I'm still get 0ms :/

  Hide contents



//Setup code for the data is here

QueryPerformanceCounter(&startTime);
for (int i = 0; i < INT_MAX; ++i)
{
	for (int j = 0; j < 4; ++j)
	{
      		//smMat (input matrix), smVec (input vector), smRes (output result)
		simdMul(smMat, smVec, smRes);
	}
}
QueryPerformanceCounter(&endTime);
std::cout << "New RAW SIMD Solution TIME: " + std::to_string((double)((endTime.QuadPart - startTime.QuadPart) * 1000000) / (double)frq.QuadPart) + "micro" << std::endl;

//Attempt to use the result
float rData[4];
_mm_store_ps(rData, smRes.data);

 

 

You're still not using the result.

printf() rData[0] through rData[3] so it is used.

And source your input from something unknown, like argv from main or by reading from a file; else the optimizer may perform the code at compile time and hardcode everything (since everything could be otherwise be resolved at compile time rather than calculating it at runtime).

33 minutes ago, noodleBowl said:

Can you explain what a wavefront, warp, and bank conflict is? I never really heard of these terms before

Sure. I can save you some time by telling you warp and wavefront are synonims. A warp is how NVIDIA marketing dept calls them, a wavefront is how AMD's marketing dept calls them.

You can also check my Where do I start Graphics Programming post for resources. Particularly the "Very technical about GPUs." section. You may find the one that says "Latency hiding in GCN" relevant.

As for bank conflicts... you got it quite right. Memory is subdivided into banks. If all threads in a wavefront access the same bank, all is ok. If each thread access a different bank, all is ok. But if some of the threads access the same bank, then things get slower.

33 minutes ago, noodleBowl said:

I'm not really sure how I feel about this, because a part of me says "good way better!" and the other part says "it could better...". But I feel like in order to get it to be better I need to start exploring different methods such as the ones described by @Matias Goldberg and others. Also looking into using multithreading, since I'm definitely only using a single thread

I was just trying to point out there's always a more efficient way to do things.

But you're doing a videogame. At some point you have to say "STOP!" to yourself, or else you'll end up in an endless spiral of constantly reworking things out and never finishing your game. There's always a better way. You need to know when something is good enough.

I suggest you go for the vertex shader option I gave you (inPos * worldMatrix[vertexId / 6u]). If that gives you enough performance for what you need, move on.

Trying Compute Shader approaches also restricts your target (e.g. older GPUs won't be able to run it) which is a terrible idea for a 2D game.

This topic is closed to new replies.

Advertisement