Advertisement

Trying to finding bottlenecks in my renderer

Started by December 07, 2017 07:55 AM
37 comments, last by Matias Goldberg 7 years, 1 month ago

I just finished up my 1st iteration of my sprite renderer and I'm sort of questioning its performance.

Currently, I am trying to render 10K worth of 64x64 textured sprites in a 800x600 window. These sprites all using the same texture, vertex shader, and pixel shader. There is basically no state changes. The sprite renderer itself is dynamic using the D3D11_MAP_WRITE_NO_OVERWRITE then D3D11_MAP_WRITE_DISCARD when the vertex buffer is full. The buffer is large enough to hold all 10K sprites and execute them in a single draw call. Cutting the buffer size down to only being able to fit 1000 sprites before a draw call is executed does not seem to matter / improve performance.  When I clock the time it takes to complete the render method for my sprite renderer (the only renderer that is running) I'm getting about 40ms. Aside from trying to adjust the size of the vertex buffer, I have tried using 1x1 texture and making the window smaller (640x480) as quick and dirty check to see if the GPU was the bottleneck, but I still get 40ms with both of those cases. 

I'm kind of at a loss. What are some of the ways that I could figure out where my bottleneck is?
I feel like only being able to render 10K sprites is really low, but I'm not sure. I'm not sure if I coded a poor renderer and there is a bottleneck somewhere or I'm being limited by my hardware

Just some other info:


Dev PC specs:

GPU: Intel HD Graphics 4600 / Nvidia GTX 850M (Nvidia is set to be the preferred GPU in the Nvida control panel. Vsync is set to off)
CPU: Intel Core i7-4710HQ @ 2.5GHz

Renderer:


//The renderer has a working depth buffer

//Sprites have matrices that are precomputed. These pretransformed vertices are placed into the buffer
Matrix4 model = sprite->getModelMatrix();
verts[0].position = model * verts[0].position;
verts[1].position = model * verts[1].position;
verts[2].position = model * verts[2].position;
verts[3].position = model * verts[3].position;
verts[4].position = model * verts[4].position;
verts[5].position = model * verts[5].position;

//Vertex buffer is flaged for dynamic use
vertexBuffer = BufferModule::createVertexBuffer(D3D11_USAGE_DYNAMIC, D3D11_CPU_ACCESS_WRITE, sizeof(SpriteVertex) * MAX_VERTEX_COUNT_FOR_BUFFER);

//The vertex buffer is mapped to when adding a sprite to the buffer
//vertexBufferMapType could be D3D11_MAP_WRITE_NO_OVERWRITE or D3D11_MAP_WRITE_DISCARD depending on the data already in the vertex buffer
D3D11_MAPPED_SUBRESOURCE resource = vertexBuffer->map(vertexBufferMapType); 
memcpy(((SpriteVertex*)resource.pData) + vertexCountInBuffer, verts, BYTES_PER_SPRITE);
vertexBuffer->unmap();

//The constant buffer used for the MVP matrix is updated once per draw call
D3D11_MAPPED_SUBRESOURCE resource = mvpConstBuffer->map(D3D11_MAP_WRITE_DISCARD);
memcpy(resource.pData, projectionMatrix.getData(), sizeof(Matrix4));
mvpConstBuffer->unmap();

Vertex / Pixel Shader:


cbuffer mvpBuffer : register(b0)
{
	matrix mvp;
}

struct VertexInput
{
	float4 position : POSITION;
	float2 texCoords : TEXCOORD0;
	float4 color : COLOR;
};

struct PixelInput
{
	float4 position : SV_POSITION;
	float2 texCoords : TEXCOORD0;
	float4 color : COLOR;
};

PixelInput VSMain(VertexInput input)
{
	input.position.w = 1.0f;

	PixelInput output;
	output.position = mul(mvp, input.position);
	output.texCoords = input.texCoords;
	output.color = input.color;

	return output;
}

Texture2D shaderTexture;
SamplerState samplerType;
float4 PSMain(PixelInput input) : SV_TARGET
{
	float4 textureColor = shaderTexture.Sample(samplerType, input.texCoords);
	
	return textureColor;
}

 

If anymore info is needed feel free to ask, I would really like to know how I can improve this assuming I'm not hardware limited

Look at your CPU usage in the task manager. If the rendering thread is at 100%, then your renderer is the bottleneck. If your rendering thread is not at 100%, then your GPU is the bottleneck.

Advertisement
2 hours ago, noodleBowl said:

Nvidia is set to be the preferred GPU in the Nvida control panel.

Add this to some non-empty .cpp file (if I put them in an empty .cpp file, it seems to be ignored) to automatically chose the dedicated instead of integrated GPU:


extern "C" {
    __declspec(dllexport) DWORD NvOptimusEnablement;
}
extern "C" {
    __declspec(dllexport) int AmdPowerXpressRequestHighPerformance;
}

This will avoid changing the Nvidia control panel for all your different builds.

🧙

What happens if you don't update sprite vertices per frame? (I assume uploading that much data is the bottleneck, you may consider uploading the transforms instead, which would be 4 values per sprite instead 6 * 4.)

Edit: Additionally you probably should use double buffering or a ring buffer to allow some frames of latency for the GPU, if you don't already.

I tried something similar with Vulkan and Fury GPU:

Render 2 million textured boxes, vertex.w = integer index to pick the proper 4*4 matrix from a regular buffer (not uniform as usual) -> 80 fps.

I do not remember if this number was with or without per frame upload, probably without, but the upload was definitively the bottleneck, especially because i did not use double buffering IIRC.

 

6 hours ago, noodleBowl said:

input.position.w = 1.0f;

Just use a float3 Position instead of float4 Position, you will get the w coordinate of 1.0f for free. Furthermore, it does not make sense to use a float4 and to immediately overwrite the w coordinate. Just use an explicit float3 to inform the compiler.

6 hours ago, noodleBowl said:

Matrix4 model = sprite->getModelMatrix(); verts[0].position = model * verts[0].position; verts[1].position = model * verts[1].position; verts[2].position = model * verts[2].position; verts[3].position = model * verts[3].position; verts[4].position = model * verts[4].position; verts[5].position = model * verts[5].position;

A sprite is basically a quad consisting of two triangles. You can reuse the position of the shared vertices. This will reduce the number of matrix multiplications by 1/3.

6 hours ago, noodleBowl said:

When I clock the time it takes to complete the render method for my sprite renderer (the only renderer that is running) I'm getting about 40ms. Aside from trying to adjust the size of the vertex buffer, I have tried using 1x1 texture and making the window smaller (640x480) as quick and dirty check to see if the GPU was the bottleneck, but I still get 40ms with both of those cases.

If you skip the draw, do you still have +- 40ms? If this is the case, skip the map/unmaps as well. If you still have 40ms, your CPU is definitely the culprit (and not the code that you are showing).

🧙

7 hours ago, Michael Aganier said:

Look at your CPU usage in the task manager. If the rendering thread is at 100%, then your renderer is the bottleneck. If your rendering thread is not at 100%, then your GPU is the bottleneck.

It's almost 2018. Update your Windows 10 and your Task Manager will be able to show GPU usage % and GPU memory usage.

Advertisement
13 hours ago, noodleBowl said:

When I clock the time it takes to complete the render method for my sprite renderer (the only renderer that is running) I'm getting about 40ms. 

Is that just the map, memcpy, unmap shown above? Or does it involve drawing / Present too? 

Add more detail to the timing - see if you can find which specific function is using most of that time. Also measure how long Present is taking. 

Thanks for all the responses! Tried to cover everything, let me know if I missed something

20 hours ago, Michael Aganier said:

Look at your CPU usage in the task manager. If the rendering thread is at 100%, then your renderer is the bottleneck. If your rendering thread is not at 100%, then your GPU is the bottleneck.

12 hours ago, Zaoshi Kaba said:

It's almost 2018. Update your Windows 10 and your Task Manager will be able to show GPU usage % and GPU memory usage.

Not sure how helpful this is, but looking at my task manager its says:


CPU: ~21% (Amount used by my application. Not total CPU usage)
GPU 0 [Intel HD Graphics]: ~11%
GPU 1 [NVidia GeForce GTX 850M]: ~18%

This is rendering 10K sprites with a 64x64 texture in a 800x600 window

 

14 hours ago, JoeJ said:

What happens if you don't update sprite vertices per frame? (I assume uploading that much data is the bottleneck, you may consider uploading the transforms instead, which would be 4 values per sprite instead 6 * 4.)

So I don't think this is exactly what you mean, but speaking from a map/unmap stand point if I move things around and only map once per draw call my time goes down to 25ms. To do this I created an intermediate array that is the same size as my vertex buffer. Then I place my sprite data into this intermediate array, when I need to draw I just do a memcpy straight into the vertex buffer 


//Created at Sprite Renderer init
vertices = new SpriteVertex[MAX_VERTEX_COUNT_FOR_BUFFER];

//In side of my function that flushes the buffer
resource = vertexBuffer->map(vertexBufferMapType);
memcpy(resource.pData, vertices, vertexCountInBuffer * sizeof(SpriteVertex));
vertexBuffer->unmap();

graphicsDevice->getDeviceContext()->Draw(vertexCountToDraw, vertexCountDrawnOffset);

 

14 hours ago, matt77hias said:

Just use a float3 Position instead of float4 Position, you will get the w coordinate of 1.0f for free. Furthermore, it does not make sense to use a float4 and to immediately overwrite the w coordinate. Just use an explicit float3 to inform the compiler.

Currently my SpriteVertex class is using a float3 for the position on the CPU side.


class SpriteVertex
{

public:
	SpriteVertex();
	SpriteVertex(Vector3 position, Vector2 texCoords, Color color);
	~SpriteVertex();
	Vector3 position;
	Vector2 texCoords;
	Color color;
};

On the shader side I have it as float4 because of the MVP matrix. Changing the position float3 (shader side) makes the window just show red. I assume I'm super zoomed into the sprites or something. I removed the unneeded input.position.w = 1.0f though

14 hours ago, matt77hias said:

A sprite is basically a quad consisting of two triangles. You can reuse the position of the shared vertices. This will reduce the number of matrix multiplications by 1/3.

Currently I have no index buffer setup, so I will have to go back and try this out. I do believe this would help a little bit in the very least, because you are right I would do less matrix calculations this way

14 hours ago, matt77hias said:

If you skip the draw, do you still have +- 40ms? If this is the case, skip the map/unmaps as well. If you still have 40ms, your CPU is definitely the culprit (and not the code that you are showing).

So if I comment out the Draw call I still have ~40ms. If I also take out the map/unmap calls I get around ~36ms. So there is a minor different but I'm starting to think my CPU is the issue.
 

8 hours ago, Hodgman said:

Is that just the map, memcpy, unmap shown above? Or does it involve drawing / Present too? 

Add more detail to the timing - see if you can find which specific function is using most of that time. Also measure how long Present is taking.

The 40ms time is just the cost of doing the render, so this is just the Draw and unmap/map calls. When I time this function I'm doing it like so:


void SpriteRenderer::render(double deltaTime)
{
	//Get the start time
	QueryPerformanceCounter(&startTime);

	renderStart(); //Setup/reset since other renderes may have ran. Only this renderer is running
	sortRenderList(); //This is only done once. On the first frame. Only sorting by texture too
	
	Sprite* sprite = nullptr;
	for (std::vector<Sprite*>::iterator i = renderList.begin(); i != renderList.end(); ++i)
	{
		sprite = (*i);
		if (sprite->isVisible() == false)
			continue;

      		//Put the sprite into the buffer. This is where the map/unmap calls are
		addToVertexBuffer(sprite);
	}

    	//Draw the sprites that were placed in the buffer. Draw call is here
	flushVertexBuffer();
      
    	//Get the end time and calculate how long it took
	QueryPerformanceCounter(&endTime);
	Logger::info("RENDER TIME: " + std::to_string(((endTime.QuadPart - startTime.QuadPart) * 1000) / frq.QuadPart));

}

void SpriteRenderer::addToVertexBuffer(Sprite* sprite)
{
	Texture* spriteTexture = sprite->getTexture();
	if (spriteTexture != boundTexture)
	{
		flushVertexBuffer();
		bindTexture(spriteTexture);
	}

	if (vertexCountInBuffer == MAX_VERTEX_COUNT_FOR_BUFFER)
	{
		flushVertexBuffer();
		vertexCountInBuffer = 0;
		vertexCountDrawnOffset = 0;
		vertexBufferMapType = D3D11_MAP_WRITE_DISCARD;
	}

  	/* Code to setup the sprite. Vertex transform, flipping, applying texture clip rect, etc */
  
  	//Put the sprite in the buffer
	D3D11_MAPPED_SUBRESOURCE resource = vertexBuffer->map(vertexBufferMapType);
	memcpy(((SpriteVertex*)resource.pData) + vertexCountInBuffer, verts, BYTES_PER_SPRITE);
	vertexBuffer->unmap();

	vertexCountToDraw += VERTEX_PER_QUAD;
	vertexCountInBuffer += VERTEX_PER_QUAD;
	vertexBufferMapType = D3D11_MAP_WRITE_NO_OVERWRITE;
}

void SpriteRenderer::renderStart()
{
	graphicsDevice = GraphicsDeviceModule::getGraphicsDevice();
	graphicsDevice->getDeviceContext()->VSSetShader(defaultVertexShader->getShader(), 0, 0);
	graphicsDevice->getDeviceContext()->VSSetConstantBuffers(0, 1, mvpConstBuffer->getBuffer());
	graphicsDevice->getDeviceContext()->PSSetShader(defaultPixelShader->getShader(), 0, 0);
	graphicsDevice->getDeviceContext()->IASetInputLayout(inputLayout->getInputLayout());
	graphicsDevice->getDeviceContext()->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
	graphicsDevice->getDeviceContext()->IASetVertexBuffers(0, 1, vertexBuffer->getBuffer(), &STRIDE_PER_VERTEX, &VERTEX_BUFFER_OFFSET);

	boundTexture = nullptr;
}

Now the one thing I'm not sure about is that when I time like the above (using the QueryPerformanceCounter) am I really timing my methods calls or am I timing how long they take to return. This probably makes more sense with timing something like the Present timer


QueryPerformanceCounter(&startTime);
GraphicsDeviceModule::getGraphicsDevice()->present();
QueryPerformanceCounter(&endTime);
Logger::info("PRESENT TIME: " + std::to_string(((endTime.QuadPart - startTime.QuadPart) * 1000) / frq.QuadPart));

Did I just time how long it really takes to present everything to the screen or did I just time how long it took to post the command to the GPU? I think I'm timing the how long it takes to return since my time comes back as 0ms 

1 hour ago, noodleBowl said:

Not sure how helpful this is, but looking at my task manager its says:



CPU: ~21% (Amount used by my application. Not total CPU usage)
GPU 0 [Intel HD Graphics]: ~11%
GPU 1 [NVidia GeForce GTX 850M]: ~18%

You have to look at individual cores, but if you have your total CPU at 21%, it means one of the cores might be running at 100% which inflates the average.

The GPU usage is not important because we are not measuring the performance of the GPU. We are measuring the performance of your renderer to prepare instructions on the CPU.

Knowing the usage of the rendering thread is important because If it is at 100%, it means that the GPU is waiting for more instructions because you're not sending them fast enough or in a such a way that the GPU can parallelize them. If this is the case, you have a 100% confirmation that the problem is your renderer.

This is an answer to your first question:

23 hours ago, noodleBowl said:

What are some of the ways that I could figure out where my bottleneck is?

 

BTW if you use Visual Studio, you can use the built-in profiler. This will give you a rough idea of the methods taking most of the time. Furthermore, they do not use your timer. So you can rule out the issues you think to have with your timer.

🧙

This topic is closed to new replies.

Advertisement