Introduction
In modern projects, to produce a nice looking scene the engine will render thousands of different objects: characters, buildings, landscape, effects and more. Of course, there are several ways to render geometry on the screen. In this article, we consider how to do that effectively, measure and compare the cost of rendering API calls - note however that only the CPU load it measured, which may not be 100% accurate but can give an idea of the relative costs. Consider the cost of API calls:
- state changes (frame buffers, vertex buffers, shaders, constants, textures)
- different types of geometry instancing and compare their performance
- several practical examples of how one should optimize geometry render in projects.
I will cover only the OpenGL API. I will not describe details, parameters and variations of each API call. There are reference books and manuals for this purpose. Computer configuration for all tests: Intel Core i5-4460 3.2GHz., Radeon R9 380.
Measurements:- The right way to correct performance measurements is to catch the overall frame timing.
Even if there are some extra things like clearing back buffer & fbo binding.
Exact equation: AVG_TEST_TIME = (CURRENT_TIME_AFTER_N_ITERATIONS - TIME_WHEN_WE_STARTED_FIRST_ITERATION) / NUMBER_OF_ITERATIONS
- In all calculations, time is in ms.
- Perform one test several times and calculate average time. NUMBER_OF_ITERATIONS should be pretty large: 500-1000. Otherwise we get too large difference from launch to launch.
- Measuring instancing time we must sure that GPU is not bottleneck. As we want to know just CPU time. And how it scales when use different amount of instances.
States changing
We want to see 'reach' picture on the screen, a lot of unique objects with a lot of details. For this purpose engine takes all visible objects in camera, sets their parameters (vertex buffers, shaders, material parameters, textures, etc.) and send them to render. All these actions performed with special API commands. Let's consider them, make some tests to understand how to organize the rendering process optimally.
Let's measure the cost of different OpenGL calls: dip (draw index primitive), change of shaders, vertex buffers, textures, shader parameters.
Dips
Dip (draw indexed primitive) -- command to GPU to render a bunch of geometry, more often triangles. Off course first we need to tell - what geometry we want to show, with what shader, set some options. But dip renders geometry; all other commands just describe parameters of what we want to show. The dip's price usually includes all related state changes - not only one command. Of course, all depends on the amount of state changes. First, consider the simplest case - cost of one thousand simple dips, without state changes.
void simple_dips()
{
glBindVertexArray(ws_complex_geometry_vao_id); //what geometry to render
simple_geometry_shader.bind(); //with what shader
//a lot of simple dips
for (int i = 0; i < CURRENT_NUM_INSTANCES; i++)
glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i+1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES*sizeof(int))); //simple dip
}
For 1? dips we get 0.41 ms.
Frame buffer change
FBO (frame buffer object) -- is an object, which allows rendering image not to the screen, but to another surface, which lately one could use as texture in shaders. Fbo changes not so often as other elements, but at the same time, the change cost is quite expensive for the CPU.
void fbo_change_test()
{
//clear FBO
glViewport(0, 0, window_width, window_height);
glClearColor(0.0f / 255.0f, 0.0f / 255.0f, 0.0f / 255.0f, 0.0);
for (int i = 0; i < NUM_DIFFERENT_FBOS; i++)
{
glBindFramebuffer(GL_FRAMEBUFFER, fbo_buffer[i % NUM_DIFFERENT_FBOS]);
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
}
//prepare dip
glBindVertexArray(ws_complex_geometry_vao_id);
simple_geometry_shader.bind();
//bind FBO, render one object... repeat N times
for (int i = 0; i < NUM_FBO_CHANGES; i++)
{
glBindFramebuffer(GL_FRAMEBUFFER, fbo_buffer[i % NUM_DIFFERENT_FBOS]); //bind fbo
glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
}
glBindFramebuffer(GL_FRAMEBUFFER, 0); //set rendering to the screen
}
For 200 fbo changes we get 1.97 ms.
One needs to change FBO usually for post effects and different passes, like: reflections, rendering into cubemap, creating virtual textures, etc. Many things like virtual textures could be organized as atlases, to set FBO only once and change for example just viewport. Render in cubemap might be replaced on another technique. For example on dual paraboloid rendering. The matter of course, not only in FBO changes, but in the number of passes of scene rendering, material changes, etc. In general, the less state changes the better.
Shader changes
Shaders usually describe one of the scene's materials or effect techniques. The more materials, kinds of surfaces the more shaders. Several materials might vary slightly. These should be combined into one and switching between them make as condition in the shader, The number of materials directly influence on dips amount.
void shaders_change_test()
{
glBindVertexArray(ws_complex_geometry_vao_id);
for (int i = 0; i < CURRENT_NUM_INSTANCES; i++)
{
simple_color_shader[i%NUM_DIFFERENT_SIMPLE_SHADERS].bind(); //bind certain shader
glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
}
}
For 1000 shader changes we get 2.90 ms.
Changing shader here also includes transferring world-view-proj matrix as a parameter. Otherwise we could not render anything. Cost of parameters changing we measure in next step.
Shader parameters changing
Often materials make universal with a lot of options to get different kinds of materials. An easy way to make a variety of pictures, each character/object unique. We need somehow transfer to shader these parameters. This could be done with API commands glUniform*.
uniforms_changes_test_shader.bind();
glBindVertexArray(ws_complex_geometry_vao_id);
for (int i = 0; i < CURRENT_NUM_INSTANCES; i++)
{
//set uniforms for this dip
for (int j = 0; j < NUM_UNIFORM_CHANGES_PER_DIP; j++)
glUniform4fv(ColorShader_uniformLocation[j], 1, randomColors[(i*NUM_UNIFORM_CHANGES_PER_DIP + j) % MAX_RANDOM_COLORS].x);
glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
}
It is not optimal to set parameters individually for each instance/object. Usually all instance data might be packed into 1 large buffer and transferred to gpu with one command. It only remains for each object to set a shift - where it's data placed.
//copy data to ssbo buffer
glBindBuffer(GL_SHADER_STORAGE_BUFFER, instances_uniforms_ssbo);
float *gpu_data = (float*)glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, CURRENT_NUM_INSTANCES * NUM_UNIFORM_CHANGES_PER_DIP * sizeof(vec4), GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT);
memcpy(gpu_data, all_instances_uniform_data[0], CURRENT_NUM_INSTANCES * NUM_UNIFORM_CHANGES_PER_DIP * sizeof(vec4)); //copy instances data
//bind for shader to 0 point (shader will read data from this link point)
glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, instances_uniforms_ssbo);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);
//render
uniforms_changes_ssbo_shader.bind();
glBindVertexArray(ws_complex_geometry_vao_id);
static int uniformsInstancing_data_varLocation = glGetUniformLocation(uniforms_changes_ssbo_shader.programm_id, instance_data_location);
for (int i = 0; i < CURRENT_NUM_INSTANCES; i++)
{
//set parameter to sahder - where object data located
glUniform1i(uniformsInstancing_data_varLocation, i*NUM_UNIFORM_CHANGES_PER_DIP);
glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
}
Uniforms changes test takes 1.27 ms. Same test for shader parameters transferring using SSBO takes 0.80 ms.
Using glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_WRITE_ONLY); causes CPU and GPU synchronization which should be avoided. One should use glMapBufferRange with flag GL_MAP_UNSYNCHRONIZED_BIT, to prevent synchronization. But programmer should guaranty that overwriting data arren't using by GPU right now. Otherwise we get bugs as we rewriting data which are reading by GPU now. To completely resolve this problem, use triple buffering. When we use current buffer for writing data, the rest 2 uses GPU. Plus there is more optimal mapping buffer method with flags GL_MAP_PERSISTENT_BIT and GL_MAP_COHERENT_BIT.
Changing vertex buffers
There are a lot of objects with different geometries in the scene. This geometry is usually placed in different vertex buffers. To render another object with different geometry - even with the same material - we need to change vertex buffer. There are techniques which allow effectively rendering different geometry with same material with only one dip: MultiDrawIndirect, Dynamic vertex pulling. Such geometry should be placed in one buffer.
void vbo_change_test()
{
simple_geometry_shader.bind();
for (int i = 0; i < CURRENT_NUM_INSTANCES; i++)
{
glBindVertexArray(separate_geometry_vao_id[i % NUM_SIMPLE_VERTEX_BUFFERS]); //change vbo
glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
}
}
For 1000 VBO changes we get 0.95 ms.
Textures changes
Textures give surfaces a detailed view. You can get a very large variety in the picture simply by changing the textures, blending different textures in the shader. Textures have to be changed frequently, but you can put them in the so-called texture array, to bind it only once for lots of dips and access to textures through an index in the shader. Same geometry with different textures might be rendered using instancing.
void textures_change_test()
{
glBindVertexArray(ws_complex_geometry_vao_id);
int counter = 0;
//switch between tests
if (test_type == ARRAY_OF_TEXTURES_TEST)
{
array_of_textures_shader.bind();
for (int i = 0; i < CURRENT_NUM_INSTANCES; i++)
{
//bind textures for this dip
for (int j = 0; j < NUM_TEXTURES_IN_COMPLEX_MATERIAL; j++)
{
glActiveTexture(GL_TEXTURE0 + j);
glBindTexture(GL_TEXTURE_2D, array_of_textures[counter % TEX_ARRAY_SIZE]);
glBindSampler(j, Sampler_linear); counter++;
}
glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
}
}
else if (test_type == TEXTURES_ARRAY_TEST)
{
//bind texture aray for all dips
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_2D_ARRAY, texture_array_id);
glBindSampler(0, Sampler_linear);
//variable to tell shader - what textures uses this dip
static int textureArray_usedTex_varLocation = glGetUniformLocation(textureArray_shader.programm_id, used_textures_i);
textureArray_shader.bind();
float used_textures_i[6];
for (int i = 0; i < CURRENT_NUM_INSTANCES; i++)
{
//fill data - what textures uses this dip
for (int j = 0; j < 6; j++)
{
used_textures_i[j] = counter % TEX_ARRAY_SIZE;
counter++;
}
glUniform1fv(textureArray_usedTex_varLocation, 6, used_textures_i[0]); //transfer to shader, tell what textures this material uses
glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
}
}
}
Simple textures changes test (for 1000 dips) takes 3.27 ms. But we make NUM_TEXTURES_IN_COMPLEX_MATERIAL textures changes per dip. We should take into account this later, calculating glBindTexture cost. Using texture array performing same test we get 0.87 ms.
Comparative estimation of state changes
Below is a table with the execution cost/time of all performed tests.
Table 1. State changes tests time
Test typeSIMPLE_DIPS_TEST0.41FBO_CHANGE_TEST 1.97SHADERS_CHANGE_TEST2.90UNIFORMS_SIMPLE_CHANGE_TEST1.27UNIFORMS_SSBO_CHANGE_TEST0.80VBO_CHANGE_TEST0.95 ARRAY_OF_TEXTURES_TEST3.27TEXTURES_ARRAY_TEST0.87Using this results we are able to calculate API call cost. Absolute cost per 1000 API calls. Relative cost calculate in relation to the simple dip call (glDrawRangeElements).
Table 2. API call cost (per 1k calls)
API callAbsolute costRelative cost %glBindFramebuffer 9.442314%glUseProgram2.49610%glBindVertexArray 0.54132%glBindTexture0.48116%glDrawRangeElements 0.41100%glUniform4fv0.0921%Of course, one should be very cautious to measurements as they will change depending on the version of the driver and hardware.
Instancing
Instancing invented to quickly render the same geometry with different parameters. Each object has a unique index according to which we can take desired for this object parameters in she shader, vary some options, etc. Main advantage of using instancing - we can greatly reduce the number of dips.
We can pack all instances parameters in the buffer, transfer them to GPU and make just one dip. Storing data in the buffer is a good optimization itself - we saving on what it is not necessary to constantly change the shader parameters. Also, if instance data do not change (for example we exactly know that it is static geometry), we don't need to transfer data to GPU every frame, actually just once at program/level start. In general, for optimal rendering we should first to pack all instances data to one buffer, transfer them to GPU with one command. For each dip, type og geometry - just set the offset where to find instances data for this dip. Using instance index (gl_InstanceID in OpenGL) we able to sample certain data for this instance/object.
There are a lot of ways to store data in OpenGL: vertex buffer (VBO), uniform buffer (UBO), texture buffer (TBO), shader storage buffer (SSBO), textures. There are various features for each buffer type. Consider that.
Texture instancing
All data stored in the texture. To effectively change data in texture one should use special structures - Pixel Buffer Object (PBO) which allow transferring data asynchronously from CPU to GPU. CPU does not wait until the data will be transferred and continues to work.
Creation code:
glGenBuffers(2, textureInstancingPBO);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, textureInstancingPBO[0]); //GL_STREAM_DRAW_ARB means that we will change data every frame
glBufferData(GL_PIXEL_UNPACK_BUFFER, INSTANCES_DATA_SIZE, 0, GL_STREAM_DRAW_ARB);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, textureInstancingPBO[1]);
glBufferData(GL_PIXEL_UNPACK_BUFFER, INSTANCES_DATA_SIZE, 0, GL_STREAM_DRAW_ARB);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
//create texture where we will store instances data on gpu
glGenTextures(1, textureInstancingDataTex);
glBindTexture(GL_TEXTURE_2D, textureInstancingDataTex);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_REPEAT);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_REPEAT);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_R, GL_REPEAT); //in each line we store NUM_INSTANCES_PER_LINE object's data. 128 in our case
//for each object we store PER_INSTANCE_DATA_VECTORS data-vectors. 2 in our case
//GL_RGBA32F, we have float32 data
//complex_mesh_instances_data source data of instances, if we are not going to update data in the texture
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F, NUM_INSTANCES_PER_LINE * PER_INSTANCE_DATA_VECTORS, MAX_INSTANCES / NUM_INSTANCES_PER_LINE, 0, GL_RGBA, GL_FLOAT, &complex_mesh_instances_data[0]);
glBindTexture(GL_TEXTURE_2D, 0);
Texture update:
glBindTexture(GL_TEXTURE_2D, textureInstancingDataTex);
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER, textureInstancingPBO[current_frame_index]); // copy pixels from PBO to texture object
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, NUM_INSTANCES_PER_LINE * PER_INSTANCE_DATA_VECTORS, MAX_INSTANCES / NUM_INSTANCES_PER_LINE, GL_RGBA, GL_FLOAT, 0); // bind PBO to update pixel values
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER, textureInstancingPBO[next_frame_index]);
//http://www.songho.ca/opengl/gl_pbo.html
// Note that glMapBufferARB() causes sync issue.
// If GPU is working with this buffer, glMapBufferARB() will wait(stall)
// until GPU to finish its job. To avoid waiting (idle), you can call
// first glBufferDataARB() with NULL pointer before glMapBufferARB().
// If you do that, the previous data in PBO will be discarded and
// glMapBufferARB() returns a new allocated pointer immediately
// even if GPU is still working with the previous data.
glBufferData(GL_PIXEL_UNPACK_BUFFER, INSTANCES_DATA_SIZE, 0, GL_STREAM_DRAW_ARB);
gpu_data = (float*)glMapBuffer(GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY_ARB);
if (gpu_data)
{
memcpy(gpu_data, complex_mesh_instances_data[0], INSTANCES_DATA_SIZE); // update data
glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER); //release pointer to mapping buffer
}
Rendering using texture instancing:
//bind texture with instances data
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_2D, textureInstancingDataTex);
glBindSampler(0, Sampler_nearest);
glBindVertexArray(geometry_vao_id); //what geometry to render
tex_instancing_shader.bind(); //with what shader
//tell shader texture with data located, what name it has
static GLint location = glGetUniformLocation(tex_instancing_shader.programm_id, s_texture_0);
if (location >= 0)
glUniform1i(location, 0); //render group of objects
glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, CURRENT_NUM_INSTANCES);
Vertex shader to access the data:
#version 150 core
in vec3 s_pos;
in vec3 s_normal;
in vec2 s_uv;
uniform mat4 ModelViewProjectionMatrix;
uniform sampler2D s_texture_0; out vec2 uv; out vec3 instance_color; void main() { const vec2 texel_size = vec2(1.0 / 256.0, 1.0 / 16.0); const int objects_per_row = 128; const vec2 half_texel = vec2(0.5, 0.5); //calc texture coordinates - where our instance data located //gl_InstanceID % objects_per_row - index of object in the line //multiple by 2 as each object has 2 vectors of data //gl_InstanceID / objects_per_row - in what line our data located //multiple by texel_size gieves us 0..1 uv to sample from texture from interer texel id vec2 texel_uv = (vec2((gl_InstanceID % objects_per_row) * 2, floor(gl_InstanceID / objects_per_row)) + half_texel) * texel_size; vec4 instance_pos = textureLod(s_texture_0, texel_uv, 0); instance_color = textureLod(s_texture_0, texel_uv + vec2(texel_size.x, 0.0), 0).xyz; uv = s_uv; gl_Position = ModelViewProjectionMatrix * vec4(s_pos + instance_pos.xyz, 1.0); }
Instancing through vertex buffer
Idea is to keep instance data in separate vertex buffer and have an axes to them in shader through vertex attributes. Code of buffer creation with data itself is trivial. Our main task is to modify information about vertex for shader (vertex declaration, vdecl)
//...code of base vertex declaration creation
//special atributes binding
glBindBuffer(GL_ARRAY_BUFFER, all_instances_data_vbo); //size of per instance data (PER_INSTANCE_DATA_VECTORS = 2 - so we have to create 2 additional attributes to transfer data)
const int per_instance_data_size = sizeof(vec4) * PER_INSTANCE_DATA_VECTORS;
glEnableVertexAttribArray(4); //4th vertex attribute, has 4 floats, 0 data offset
glVertexAttribPointer((GLuint)4, 4, GL_FLOAT, GL_FALSE, per_instance_data_size, (GLvoid*)(0)); //tell that we will change this attribute per instance, not per vertex
glVertexAttribDivisor(4, 1);
glEnableVertexAttribArray(5); //5th vertex attribute, has 4 floats, sizeof(vec4) data offset
glVertexAttribPointer((GLuint)5, 4, GL_FLOAT, GL_FALSE, per_instance_data_size, (GLvoid*)(sizeof(vec4)));
//tell that we will change this attribute per instance, not per vertex
glVertexAttribDivisor(5, 1);
Rendering code:
vbo_instancing_shader.bind(); //our vertex buffer wit modified vertex declaration (vdecl)
glBindVertexArray(geometry_vao_vbo_instancing_id);
glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, CURRENT_NUM_INSTANCES);
Vertex shader to access data:
#version 150 core
in vec3 s_pos;
in vec3 s_normal;
in vec2 s_uv;
in vec4 s_attribute_3; //some_data;
in vec4 s_attribute_4; //instance pos
in vec4 s_attribute_5; //instance color
uniform mat4 ModelViewProjectionMatrix;
out vec3 instance_color;
void main()
{
instance_color = s_attribute_5.xyz;
gl_Position = ModelViewProjectionMatrix * vec4(s_pos + s_attribute_4.xyz, 1.0);
}
Uniform buffer instancing, Texture buffer instancing, Shader Storage buffer instancing
These three methods are very similar to each other. They differ mostly by buffer type. Uniform buffer (UBO) characterized by small size, but it should theoretically be faster than the others. Texture buffer (TBO) has very big size. We able to store all scene instances data into it, skeletal transformations. Shader Storage Buffer (SSBO) has both properties - fast with a large size. Also, we can write data to it. The only thing - it is new extension, and the old hardware does not support it.
Uniform buffer
Creation code:
glGenBuffers(1, dips_uniform_buffer);
glBindBuffer(GL_UNIFORM_BUFFER, dips_uniform_buffer);
glBufferData(GL_UNIFORM_BUFFER, INSTANCES_DATA_SIZE, &complex_mesh_instances_data[0], GL_STATIC_DRAW);
//uniform_buffer_data
glBindBuffer(GL_UNIFORM_BUFFER, 0);
//bind iniform buffer with instances data to shader
ubo_instancing_shader.bind(true);
GLint instanceData_location3 = glGetUniformLocation(ubo_instancing_shader.programm_id, "instance_data");
//link to shader
glUniformBufferEXT(ubo_instancing_shader.programm_id, instanceData_location3, dips_uniform_buffer); //actually binding
Instancing vertex shader with uniform buffer:
#version 150 core
#extension GL_ARB_bindable_uniform : enable
#extension GL_EXT_gpu_shader4 : enable
in vec3 s_pos;
in vec3 s_normal;
in vec2 s_uv;
uniform mat4 ModelViewProjectionMatrix;
bindable uniform vec4 instance_data[4096]; //our uniform with instances data
out vec3 instance_color;
void main()
{
vec4 instance_pos = instance_data[gl_InstanceID*2];
instance_color = instance_data[gl_InstanceID*2+1].xyz;
gl_Position = ModelViewProjectionMatrix * vec4(s_pos + instance_pos.xyz, 1.0);
}
Texture Buffer
Creation code:
tbo_instancing_shader.bind();
//bind to shader as special texture
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_BUFFER, dips_texture_buffer_tex);
glTexBuffer(GL_TEXTURE_BUFFER, GL_RGBA32F, dips_texture_buffer);
glBindVertexArray(geometry_vao_id);
glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, CURRENT_NUM_INSTANCES);
Vertex shader:
#version 150 core
#extension GL_EXT_bindable_uniform : enable
#extension GL_EXT_gpu_shader4 : enable
in vec3 s_pos;
in vec3 s_normal;
in vec2 s_uv;
uniform mat4 ModelViewProjectionMatrix;
uniform samplerBuffer s_texture_0; //our TBO texture bufer
out vec3 instance_color;
void main()
{
//sample data from TBO
vec4 instance_pos = texelFetch(s_texture_0, gl_InstanceID*2);
instance_color = texelFetch(s_texture_0, gl_InstanceID*2+1).xyz;
gl_Position = ModelViewProjectionMatrix * vec4(s_pos + instance_pos.xyz, 1.0);
}
SSBO
Creation code:
glGenBuffers(1, ssbo);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, ssbo);
glBufferData(GL_SHADER_STORAGE_BUFFER, INSTANCES_DATA_SIZE, complex_mesh_instances_data[0], GL_STATIC_DRAW);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, ssbo);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0); // unbind
Render:
//bind ssbo_instances_data, link to shader at 0 binding point
glBindBuffer(GL_SHADER_STORAGE_BUFFER, ssbo_instances_data);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, ssbo_instances_data);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);
ssbo_instancing_shader.bind();
glBindVertexArray(geometry_vao_id);
glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, CURRENT_NUM_INSTANCES);
glBindVertexArray(0);
Vertex shader:
#version 430
#extension GL_ARB_shader_storage_buffer_object : require
in vec3 s_pos;
in vec3 s_normal;
in vec2 s_uv;
uniform mat4 ModelViewProjectionMatrix;
//ssbo should be binded to 0
binding point layout(std430, binding = 0)
buffer ssboData { vec4 instance_data[4096]; };
out vec3 instance_color;
void main()
{
//gl_InstanceID is unique for each instance. So we able to set per instance data
vec4 instance_pos = instance_data[gl_InstanceID*2];
instance_color = instance_data[gl_InstanceID*2+1].xyz;
gl_Position = ModelViewProjectionMatrix * vec4(s_pos + instance_pos.xyz, 1.0);
}
Uniforms instancing
Pretty simple. We have ability to set with special commands (glUniform*) several vectors with data to shader. Maximum amount depends on video card. Get the maximum number possible by calling glGetIntegerv with GL_MAX_VERTEX_UNIFORM_VECTORS parameter. For R9 380 will return 4096. Minimum value is 256.
uniforms_instancing_shader.bind();
glBindVertexArray(geometry_vao_id);
//variable - where in shader our array of uniforms located. We will write data to this array
static int uniformsInstancing_data_varLocation = glGetUniformLocation(uniforms_instancing_shader.programm_id, instance_data);
//instances data might be written with just one call if there are enough vectors.
//Just for clarity, divide into groups, because usually much more there are much more data than available uniforms.
for (int i = 0; i < UNIFORMS_INSTANCING_NUM_GROUPS; i++)
{
//write data to uniforms
glUniform4fv(uniformsInstancing_data_varLocation, UNIFORMS_INSTANCING_MAX_CONSTANTS_FOR_INSTANCING, complex_mesh_instances_data[i*UNIFORMS_INSTANCING_MAX_CONSTANTS_FOR_INSTANCING].x);
glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, UNIFORMS_INSTANCING_OBJECTS_PER_DIP);
}
Multi draw indirect
Separately consider a command that allows drawing a huge number of dips for one call. This is a very useful command which allows rendering a group of instances with different geometry, even thousands of different groups with one command. As an input, it receives an array that describes the parameters of dips: the number of indexes, shifting in vertex buffers, amount of instances per group. The restriction is that the entire geometry should be placed in one vertex buffer and rendered with one shader. Additional plus is that we can fill information about dips for MultiDraw command on GPU side, which is very useful for GPU frustum culling for example.
//fill indirect buffer with dips information. Just simple array
for (int i = 0; i < CURRENT_NUM_INSTANCES; i++)
{
multi_draw_indirect_buffer.vertexCount = BOX_NUM_INDICES;
multi_draw_indirect_buffer.instanceCount = 1;
multi_draw_indirect_buffer.firstVertex = i*BOX_NUM_INDICES;
multi_draw_indirect_buffer.baseVertex = 0;
multi_draw_indirect_buffer.baseInstance = 0;
}
glBindVertexArray(ws_complex_geometry_vao_id);
simple_geometry_shader.bind();
glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_INT, (GLvoid*)multi_draw_indirect_buffer[0], //our information about dips
CURRENT_NUM_INSTANCES, //number of dips 0);
glMultiDrawElementsIndirect command performs several glDrawElementsInstancedIndirect in one call. There is an unpleasant feature in the behavior of this command. Each such group (glDrawElementsInstancedIndirect) will have independent gl_InstanceID, i.e. each time it drops to 0 with new Draw*. Which makes difficult to access required per instance data. This problem solves by modifying vertex declaration of each type of objects being sent to the renderer. You can read an article about it Surviving without gl_DrawID. It is worth noting that glMultiDrawElementsIndirect performed huge number of dips with a single command. You don't need to compare this command with the other types of instancing.
Performance comparison of different types of instancing
Table 3. Instancing tests performance. Number of iterations = 100, top-amount of instances. cpu time (gpu time)
Instancing type50100200UBO_INSTANCING 0.35 (0.10)0.37 (0.13)0.36 (0.24)TBO_INSTANCING0.72 (0.11)0.73 (0.13)0.73 (0.25) SSBO_INSTANCING0.37 (0.09)0.40 (0.13)0.38 (0.24)VBO_INSTANCING0.36 (0.09)0.37 (0.12)0.37 (0.24)TEXTURE_INSTANCING0.38 (0.10)0.39 (0.13)0.39 (0.24) UNIFORMS_INSTANCING0.41 (0.13)0.52 (0.27)0.74 (0.51)MULTI_DRAW_INDIRECT0.63 (0.53) 1.17 (1.01)2.10 (1.93)UBO, VBO, SSBO, TEXTURE instancing types have pretty much the same 'good' timing.
TBO instancing allows to store huge amount of information, but it is slow in comparison with UBO. If possible, you should use SSBO storage. It is fast, handy and has a huge size.
Texture instancing is also a good alternative to UBO. Supported by the old hardware, you can store any amount of information. A little uncomfortable to update.
Transfering data each frame through glUniform* obviously is the slowest instancing method.
glMultiDrawElementsIndirect in tests performed 5?, 10? ? 20? dips ! But we tested repetition of test. Such amount of dips might be done by just one call. The only thing - with so many dips, an array with dips description will be pretty huge (better to use GPU for to create description).
Recommendations for optimization and conclusions
In this paper we make an analysis of API calls, measured different types of instancing performance. In general, the less state switches, the better. Use the newest features the latest version of the API: textures array, SSBO, Draw Indirect, mapping buffers with GL_MAP_PERSISTENT_BIT and GL_MAP_COHERENT_BIT flags for fast data transferring. Recommendations:
- The less states changes the better. One should group objects by material.
- You may wrap state changes (textures, buffers, shaders and other states). Check if state really changed before API call because it is much slower than just flag/index checking.
- Unite geometry in one buffer.
- Use texture arrays.
- Store data in large buffers and textures
- Use as little shaders as possible. But too complicated/universal shader with many branches obviously will be a problem. Especially on older video cards, where branching is expensive.
- Use instancing
- Use Draw Indirect if it is possible and generate information about dips on GPU side.
Some general advice:
- It is necessary to calculate bottlenecks and optimize them first.
- You need to know what limits performance - CPU or GPU and optimize it.
- Don't make work twice. Reuse results of different passes, reuse previous frames result (reprojection techniques, sorting, tracing, anything).
- Difficult calculation might be precalculated
- The best optimization - not to do the work
- Use parallel calculations: split work into parts and do them on parallel threads.
Source code of all examples. [attachment=34883:GL_API_overhead.rar]
Links:
- Beyond Porting
- OpenGL documentation
- Instancing in OpenGL
- OpenGL Pixel Buffer Object (PBO)
- Buffer Object
- Drawing with OpenGL
- Shader Storage Buffer Object
- Shader Storage Buffers Objects
- hardware caps, stats (for GL_MAX_VERTEX_UNIFORM_COMPONENTS)
- Array Texture
- Textures Array example
- MultiDrawIndirect example
- The Road to One Million Draws
- MultiDrawIndirect, Surviving without gl_DrawID
- Humus demo about Instancing
Anatoliy Gerlits February 2017