Trying to finding bottlenecks in my renderer

noodleBowl · 2017-12-23T16:53:42

I just finished up my 1st iteration of my sprite renderer and I'm sort of questioning its performance. Currently, I am trying to render 10K worth of 64x64 textured sprites in a 800x600 window. These sprites all using the same texture, vertex shader, and pixel shader. There is basically no state changes. The sprite renderer itself is dynamic using the D3D11_MAP_WRITE_NO_OVERWRITE then D3D11_MAP_WRITE_DISCARD when the vertex buffer is full. The buffer is large enough to hold all 10K sprites and execute them in a single draw call. Cutting the buffer size down to only being able to fit 1000 sprites before a draw call is executed does not seem to matter / improve performance. When I clock the time it takes to complete the render method for my sprite renderer (the only renderer that is running) I'm getting about 40ms. Aside from trying to adjust the size of the vertex buffer, I have tried using 1x1 texture and making the window smaller (640x480) as quick and dirty check to see if the GPU was the bottleneck, but I still get 40ms with both of those cases. I'm kind of at a loss. What are some of the ways that I could figure out where my bottleneck is? I feel like only being able to render 10K sprites is really low, but I'm not sure. I'm not sure if I coded a poor renderer and there is a bottleneck somewhere or I'm being limited by my hardware Just some other info: Dev PC specs: GPU: Intel HD Graphics 4600 / Nvidia GTX 850M (Nvidia is set to be the preferred GPU in the Nvida control panel. Vsync is set to off) CPU: Intel Core i7-4710HQ @ 2.5GHz Renderer: //The renderer has a working depth buffer //Sprites have matrices that are precomputed. These pretransformed vertices are placed into the buffer Matrix4 model = sprite->getModelMatrix(); verts[0].position = model * verts[0].position; verts[1].position = model * verts[1].position; verts[2].position = model * verts[2].position; verts[3].position = model * verts[3].position; verts[4].position = model * verts[4].position; verts[5].position = model * verts[5].position; //Vertex buffer is flaged for dynamic use vertexBuffer = BufferModule::createVertexBuffer(D3D11_USAGE_DYNAMIC, D3D11_CPU_ACCESS_WRITE, sizeof(SpriteVertex) * MAX_VERTEX_COUNT_FOR_BUFFER); //The vertex buffer is mapped to when adding a sprite to the buffer //vertexBufferMapType could be D3D11_MAP_WRITE_NO_OVERWRITE or D3D11_MAP_WRITE_DISCARD depending on the data already in the vertex buffer D3D11_MAPPED_SUBRESOURCE resource = vertexBuffer->map(vertexBufferMapType); memcpy(((SpriteVertex*)resource.pData) + vertexCountInBuffer, verts, BYTES_PER_SPRITE); vertexBuffer->unmap(); //The constant buffer used for the MVP matrix is updated once per draw call D3D11_MAPPED_SUBRESOURCE resource = mvpConstBuffer->map(D3D11_MAP_WRITE_DISCARD); memcpy(resource.pData, projectionMatrix.getData(), sizeof(Matrix4)); mvpConstBuffer->unmap(); Vertex / Pixel Shader: cbuffer mvpBuffer : register(b0) { matrix mvp; } struct VertexInput { float4 position : POSITION; float2 texCoords : TEXCOORD0; float4 color : COLOR; }; struct PixelInput { float4 position : SV_POSITION; float2 texCoords : TEXCOORD0; float4 color : COLOR; }; PixelInput VSMain(VertexInput input) { input.position.w = 1.0f; PixelInput output; output.position = mul(mvp, input.position); output.texCoords = input.texCoords; output.color = input.color; return output; } Texture2D shaderTexture; SamplerState samplerType; float4 PSMain(PixelInput input) : SV_TARGET { float4 textureColor = shaderTexture.Sample(samplerType, input.texCoords); return textureColor; } If anymore info is needed feel free to ask, I would really like to know how I can improve this assuming I'm not hardware limited

Graphics and GPU Programming Programming DX11 C++

Started by noodleBowl December 07, 2017 07:55 AM

37 comments, last by Matias Goldberg 7 years, 1 month ago

Infinisearch

3,058

December 22, 2017 02:11 PM

On 12/21/2017 at 7:51 AM, Eternal said:
You're never using the results of your matrix multiplication, so the entire benchmarking loop will be optimized away in release mode.

Good catch if correct, I hadn't thought about that.

On 12/20/2017 at 2:06 PM, Matias Goldberg said:
Taking 40ms for doing 40k matrix multiplications per frame for a single core sounds about correct.

I calculated about 24ms worst case, and about 9.6ms for a decent implementation for 40k transforms.

edit - for a 2.5ghz oooe processor... assuming a tight loop.

16 hours ago, Matias Goldberg said:
This will yield much better performance. Even then, it's not ideal, because accessing a different matrix every 6 threads in a wavefront will lead to bank conflicts.

Aren't bank conflicts for LDS and (in some implementations) register access? I don't think constant buffers are stored in LDS.

edit - I would say memory coalescing was relevant but a matrix currently takes up a whole cache line.

-potential energy is easily made kinetic-

Infinisearch

3,058

December 22, 2017 04:45 PM

16 hours ago, Matias Goldberg said:
Instancing will not lead to good performance, as each sprite will very likely will be given its own wavefront unless you're lucky (on an AMD GPU, you'll be using 9.4% of processing capacity while the rest is wasted!)

Oh and I've never confirmed this but I read somewhere that some GPU's can pack a wavefront/warp with work from different instances. They also had said that while with instancing work can be packed, execute/draw indirect currently isn't packed into a single wavefront/warp.

-potential energy is easily made kinetic-

Matias Goldberg

9,638

December 22, 2017 05:59 PM

1 hour ago, Infinisearch said:
Oh and I've never confirmed this but I read somewhere that some GPU's can pack a wavefront/warp with work from different instances. They also had said that while with instancing work can be packed, execute/draw indirect currently isn't packed into a single wavefront/warp.

I got confirmation from an AMD Driver engineer himself. Yes, it's true. However don't count on it. The driver can only merge your instancing into the same wavefront if several conditions are met. I don't know the exact conditions, but they're all HW limitation related. i.e. if the driver cannot 100% guarantee the GPU can always merge your instances without rendering artifacts, then it won't (even if it were completely safe given the data you're going to be feeding, but the driver doesn't know that a priori, or it would take a considerable amount of CPU cycles to determine so).

3 hours ago, Infinisearch said:
Aren't bank conflicts for LDS and (in some implementations) register access? I don't think constant buffers are stored in LDS.
edit - I would say memory coalescing was relevant but a matrix currently takes up a whole cache line.

When it comes to AMD, access to global memory may have channel and bank conflict issues. NVIDIA implements it as huge register file, so there's always a reason...

Twitter: @matiasgoldberg

Distant Souls ? Alliance AirWar ? My Free Royalty-Free Music Library

Zaoshi Kaba

8,470

December 22, 2017 07:42 PM

On 12/21/2017 at 5:10 AM, noodleBowl said:
since each sprite has its own model matrix

Do you really need whole matrix for each sprite? Perhaps you could limit it just to position? Then you will save a lot of space and increase max number of instances.

Infinisearch

3,058

December 22, 2017 08:51 PM

2 hours ago, Matias Goldberg said:
When it comes to AMD, access to global memory may have channel and bank conflict issues. NVIDIA implements it as huge register file, so there's always a reason...

Channel conflicts I suppose are possible. But again what are the bank conflict issues you speak of if memory access's in different threads of a wavefront access the same cache line they are coalesced if not they are serialized. Unless you are saying the L1 is banked, I don't see any room for bank conflicts. And as far as Nvidia goes what exactly are you saying gets implemented as a huge register file? Could you include a source if possible?

-potential energy is easily made kinetic-

noodleBowl

Author

718

December 23, 2017 04:04 PM

On 12/21/2017 at 7:51 AM, Eternal said:
You're never using the results of your matrix multiplication, so the entire benchmarking loop will be optimized away in release mode.

I'm not sure if this is the problem. I might still be doing it wrong, but I went back and attempted to use the result, but I'm still get 0ms

Spoiler



//Setup code for the data is here

QueryPerformanceCounter(&startTime);
for (int i = 0; i < INT_MAX; ++i)
{
	for (int j = 0; j < 4; ++j)
	{
      		//smMat (input matrix), smVec (input vector), smRes (output result)
		simdMul(smMat, smVec, smRes);
	}
}
QueryPerformanceCounter(&endTime);
std::cout << "New RAW SIMD Solution TIME: " + std::to_string((double)((endTime.QuadPart - startTime.QuadPart) * 1000000) / (double)frq.QuadPart) + "micro" << std::endl;

//Attempt to use the result
float rData[4];
_mm_store_ps(rData, smRes.data);

On 12/21/2017 at 7:06 PM, Matias Goldberg said:
This will yield much better performance. Even then, it's not ideal, because accessing a different matrix every 6 threads in a wavefront will lead to bank conflicts.
A more optimal path would be to update the vertices using a compute shader that processes all 6 vertices in the same thread, thus each thread in a wavefront will access a different bank (i.e. one thread per sprite).

Can you explain what a wavefront, warp, and bank conflict is? I never really heard of these terms before

From what I understand a bank conflict is where I'm trying to reuse the same memory bank that I'm already working with (this is what I gather from with the link you provided). As for a wavefront, I believe this is a grouping of threads that are executed on the gpu. I'm not sure what a warp is. I'm also not sure what the significance of a wavefront is

20 hours ago, Zaoshi Kaba said:
Do you really need whole matrix for each sprite? Perhaps you could limit it just to position? Then you will save a lot of space and increase max number of instances.

Honestly, no not really. I was really using the matrices cause I wanted to be able scale and rotate my sprites along with the normal position transforms, BUT I can definitely do all of those things without a matrix.

I went back and reworked it to not use matrices at all. Using the same testing conditions I'm maxing out at ~175K 64x64 sprites when running in release mode, also when I say maxing out I mean that I'm maintaining right at/just above 60 fps. The time it takes for my SpriteRenderer::render() method is ~9.5ms on average in these conditions. This is average is based on the time it takes for 1000 SpriteRenderer::render() executions.

I'm not really sure how I feel about this, because a part of me says "good way better!" and the other part says "it could better...". But I feel like in order to get it to be better I need to start exploring different methods such as the ones described by @Matias Goldberg and others. Also looking into using multithreading, since I'm definitely only using a single thread

Zaoshi Kaba

8,470

December 23, 2017 04:17 PM

8 minutes ago, noodleBowl said:
I'm not really sure how I feel about this, because a part of me says "good way better!" and the other part says "it could better...".

Are you still constructing vertices on CPU? If yes, try moving that to GPU using instancing.

One of ways to do that would be to create:

vertex buffer that contains only 4 vertices (there's a way to achieve same without vertex buffer at all, but let's keep it simple for now)
instancing buffer that contains position, scale (optional), rotation (optional), texcoord

Then you'd just set them both and render N number of instances of 4 vertices and use information from instancing buffer to transform vertices into sprites.

Matias Goldberg

9,638

December 23, 2017 04:53 PM

33 minutes ago, noodleBowl said:

I'm not sure if this is the problem. I might still be doing it wrong, but I went back and attempted to use the result, but I'm still get 0ms

Hide contents




//Setup code for the data is here

QueryPerformanceCounter(&startTime);
for (int i = 0; i < INT_MAX; ++i)
{
	for (int j = 0; j < 4; ++j)
	{
      		//smMat (input matrix), smVec (input vector), smRes (output result)
		simdMul(smMat, smVec, smRes);
	}
}
QueryPerformanceCounter(&endTime);
std::cout << "New RAW SIMD Solution TIME: " + std::to_string((double)((endTime.QuadPart - startTime.QuadPart) * 1000000) / (double)frq.QuadPart) + "micro" << std::endl;

//Attempt to use the result
float rData[4];
_mm_store_ps(rData, smRes.data);

You're still not using the result.

printf() rData[0] through rData[3] so it is used.

And source your input from something unknown, like argv from main or by reading from a file; else the optimizer may perform the code at compile time and hardcode everything (since everything could be otherwise be resolved at compile time rather than calculating it at runtime).

33 minutes ago, noodleBowl said:
Can you explain what a wavefront, warp, and bank conflict is? I never really heard of these terms before

Sure. I can save you some time by telling you warp and wavefront are synonims. A warp is how NVIDIA marketing dept calls them, a wavefront is how AMD's marketing dept calls them.

You can also check my Where do I start Graphics Programming post for resources. Particularly the "Very technical about GPUs." section. You may find the one that says "Latency hiding in GCN" relevant.

As for bank conflicts... you got it quite right. Memory is subdivided into banks. If all threads in a wavefront access the same bank, all is ok. If each thread access a different bank, all is ok. But if some of the threads access the same bank, then things get slower.

33 minutes ago, noodleBowl said:
I'm not really sure how I feel about this, because a part of me says "good way better!" and the other part says "it could better...". But I feel like in order to get it to be better I need to start exploring different methods such as the ones described by @Matias Goldberg and others. Also looking into using multithreading, since I'm definitely only using a single thread

I was just trying to point out there's always a more efficient way to do things.

But you're doing a videogame. At some point you have to say "STOP!" to yourself, or else you'll end up in an endless spiral of constantly reworking things out and never finishing your game. There's always a better way. You need to know when something is good enough.

I suggest you go for the vertex shader option I gave you (inPos * worldMatrix[vertexId / 6u]). If that gives you enough performance for what you need, move on.

Trying Compute Shader approaches also restricts your target (e.g. older GPUs won't be able to run it) which is a terrible idea for a 2D game.

Twitter: @matiasgoldberg

Distant Souls ? Alliance AirWar ? My Free Royalty-Free Music Library

Trying to finding bottlenecks in my renderer

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Trying to finding bottlenecks in my renderer

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines