Instancing for everyone
History
Instancing became a core feature of OpenGL starting with version 3.1 back in 2009, named ARB_draw_instanced. At this time you could only use Texture Buffer Objects (TBOs) or Uniform Buffer Objects (UBOs) to actually deliver your data into the shaders. A year later, in 2010, OpenGL 3.3 arrived with the brand new ARB_instanced_arrays extension now being a core feature. With this addition you could use actual Vertex Buffer Objects (VBOs) to deliver your data. Yay! There is a restriction to it however, you can only pass 16 vertex attributes to your vertex shader (by specification, or GL_MAX_VERTEX_ATTRIB_BINDINGS), which makes it 16 * vec4 = 64 float values. In 2010 ARB_draw_indirect (OpenGL 4.0) also made it to the core. It enables you to pass the parameters to the glDrawArrays* functions indirectly, that is from a piece of memory. In 2011 ARB_base_instance (OpenGL 4.2) became a core feature too. This allows you to specify a half-open range [x...y) of what instance data you would like to draw. To add ARB_transform_feedback_instanced was also added, that allows you to use the transform feedback data as instance data to draw.The Big Concept
So when is it appropriate to use instancing? Well when you would like to draw the same thing thousand times. The reason to this is that a single draw call (glDraw*) costs a lot of CPU power, as the driver needs to do some checking and preparation (magic!) before the function call would return. Usually on an average PC 2000 draw calls is the most you can do without hurting your frame rate too bad (remember: you have at max 33 ms per frame!). So drawing something 1000 times would make up half of your draw calls, and that is bad. Instancing solves this by allowing you to tell the driver: 'Hey, I'd like to draw this piece of geometry 1000 times'. But you would wind up with 1000 objects in the same place, right? To solve this, you can pass data that will be unique to each of the 1000 objects drawn. This is what I call 'Instance Data'. This is usually a (modelview) matrix, but for the sake of simplicity, I will only store one vec4 (position).Algorithm overview
Normal rendering: For each frame:For each object:
-upload object specific data to uniforms (or UBOs) -render the object
Instancing: For each frame:For each object:
-prepare instance data, store it in a buffer (no need to do this each frame if the buffer is static)
-upload that buffer to the GPU -render the objects using instancing using the provided Instance Data
You can clearly see that the number of draw calls is reduced from n to 1 (plus no uniform passing!).The implementation
I'm going to use a small (~600 lines) framework I wrote for prototyping techniques. This allows me to hide irrelevant code. We are going to draw cubes. The first step to drawing cubes is to create a VBO that contains the vertex data.
GLuint box = frm.create_box(); //Vertex Array Object (VAO) of the box
Then we are going to create the VBO for the instance data: the positions of the cubes. To do this we need a buffer (memory) and a VBO.
First let's bind the fresh VAO
glBindVertexArray( box );
Then create the buffer
vector positions;
positions.resize( size * size ); //make some space
Then create the VBO for this data
GLuint position_vbo;
glGenBuffers( 1, &position_vbo ); //gen vbo
glBindBuffer( GL_ARRAY_BUFFER, position_vbo ); //bind vbo
Here comes the interesting part: you need to tell the driver that you are going to use this VBO for instancing. To do this you need to tell it these things:
-which vertex attribute location will you use? (2) -how many components does each piece of data have? (vec4, so 4) -what type of data are you passing? (floats) -is this data normalized? (they are positions, so probably no) -how many bytes is each piece of data? (vec4, so 4 * sizeof( float ) ) -if this data consists of more than four components (like a mat4), then where is this specific data located (relative to the whole piece of data, in bytes)? -is this data instanced?
All this in code:
GLuint location = 2;
GLint components = 4;
GLenum type = GL_FLOAT;
GLboolean normalized = GL_FALSE;
GLsizei datasize = sizeof( vec4 );
char* pointer = 0; //no other components
GLuint divisor = 1; //instanced
glEnableVertexAttribArray( location ); //tell the location
glVertexAttribPointer( location, components, type, normalized, datasize, pointer ); //tell other data
glVertexAttribDivisor( location, divisor ); //is it instanced?
If the data you would like to pass is a mat4 for example, then you would end up using 4 vertex attribute locations to pass this data. This would require you to set up the VBO a bit differently, telling where each column (vec4) of the matrix is in each piece of data in bytes. This is required because you are passing it in GLvoid* which means that that the size of data in bytes is unknown (no pointer arithmetics). Therefore you need to work in bytes and convert that to GLvoid*.
In code:
GLuint location = 2;
GLint components = 4;
GLenum type = GL_FLOAT;
GLboolean normalized = GL_FALSE;
GLsizei datasize = sizeof( mat4 );
char* pointer = 0;
GLuint divisor = 1;
/**
Matrix:
float mat[16] =
{
1, 0, 0, 0, //first column: location at 0 + 0 * sizeof( vec4 ) bytes into the matrix
0, 1, 0, 0, //second column: location at 0 + 1 * sizeof( vec4 ) bytes into the matrix
0, 0, 1, 0, //third column: location at 0 + 2 * sizeof( vec4 ) bytes into the matrix
0, 0, 0, 1 //fourth column location at 0 + 3 * sizeof( vec4 ) bytes into the matrix
};
/**/
//you need to do everything for each vertex attribute location
for( int c = 0; c < 4; ++c )
{
glEnableVertexAttribArray( location + c ); //location of each column
glVertexAttribPointer( location + c, components, type, normalized, datasize, pointer + c * sizeof( vec4 ) ); //tell other data
glVertexAttribDivisor( location + c, divisor ); //is it instanced?
}
The divisor tells the driver if the data is instanced. If the divisor is 0 (by default) it means that the data is not instanced. If it is 1 then it will be instanced. For any other value >1 the instance id (gl_InstanceID) in the vertex shader will be divided by this value.
Next you need to load up the shaders. I'm using a super-simple deferred shader for the sake of maximizing shading efficiency, and making these shaders simple.
Vertex shader:
#version 330 core
uniform mat4 mvp; //modelviewprojection matrix
uniform mat3 normal_mat;
layout(location=0) in vec4 in_vertex; //cube vertex position
layout(location=1) in vec3 in_normal; //cube face normal
layout(location=2) in vec4 pos; //instance data, unique to each object (instance)
out vec3 normal;
void main()
{
normal = normal_mat * in_normal;
gl_Position = mvp * vec4(in_vertex.xyz + pos.xyz, 1); //write to the depth buffer
}
Pixel shader:
#version 330 core
in vec3 normal;
layout(location=0) out vec4 color; //normals go here
void main()
{
color = vec4(normal * 0.5 + 0.5, 1);
}
Loading the shaders
GLuint gbuffer_instanced_shader = 0;
frm.load_shader( gbuffer_instanced_shader, GL_VERTEX_SHADER, "../shaders/instancing2/gbuffer_instanced.vs" );
frm.load_shader( gbuffer_instanced_shader, GL_FRAGMENT_SHADER, "../shaders/instancing2/gbuffer.ps" );
GLint gbuffer_instanced_mvp_mat_loc = glGetUniformLocation( gbuffer_instanced_shader, "mvp" );
GLint gbuffer_instanced_normal_mat_loc = glGetUniformLocation( gbuffer_instanced_shader, "normal_mat" );
Finally all you need to do is render the cubes. Usually this would look something like this:
//regular rendering
glBindVertexArray( box );
for( int c = 0; c < size; ++c )
{
for( int d = 0; d < size; ++d )
{
glUniform4f( gbuffer_pos_loc, c * 3 - size, -2 + 0.5 * sin( radians( ( c + d + 1 )* timer.getElapsedTime().asSeconds() ) ), -d * 3, 0 ); //this gives it some ocean-like movement
glDrawElements( GL_TRIANGLES, 36, GL_UNSIGNED_INT, 0 ); //two triangles per face, that is 6 * 6 = 36 vertices
}
}
However for instancing you need to update the instance buffer, it looks like this:
//instanced rendering
glBindVertexArray( box );
//store positions in the buffer
for( int c = 0; c < size; ++c )
{
for( int d = 0; d < size; ++d )
{
positions[c * size + d] = vec4( c * 3 - size, -2 + 0.5 * sin( radians( ( c + d + 1 )* timer.getElapsedTime().asSeconds() ) ), -d * 3, 0 );
}
}
//upload the instance data
glBindBuffer( GL_ARRAY_BUFFER, position_vbo ); //bind vbo
//you need to upload sizeof( vec4 ) * number_of_cubes bytes, DYNAMIC_DRAW because it is updated per frame
glBufferData( GL_ARRAY_BUFFER, sizeof( vec4 ) * positions.size(), &positions[0][0], GL_DYNAMIC_DRAW );
glDrawElementsInstanced( GL_TRIANGLES, 36, GL_UNSIGNED_INT, 0, positions.size() );
This is it. The rest of the code is setting up the deferred shader, and some controls that should be pretty straightforward.
Interesting Points
Interestingly, doing the simple sin() on the CPU to update the positions became the bottleneck after ~1.000.000 cubes. If I used a matrix, then matrix multiplication was an issue after ~160.000 cubes. This means that even when doing instancing you still need to be clever about the CPU side (doing the matrix muls using SIMD instructions, or in the shaders). After all, updating positions for lots of data is a data parallel task that the GPU usually likes.Conclusion
Instancing is very important to make sure draw calls are not a bottleneck. I hope more and more people will end up using it in the future. Additional resources:-project source controls: WASD, space to toggle between instancing (green) and normal rendering (red) building: use cmake to generate project (set CMAKE_BUILD_TYPE to "Release") https://docs.google.com/file/d/0B33Sh832pOdObExOLTRCRF9QWU0/edit?usp=sharing -OpenGL history http://www.opengl.org/wiki/History_of_OpenGL -Instancing on the OpenGL wiki http://www.opengl.org/wiki/Vertex_Rendering#Instancing http://www.opengl.org/wiki/Vertex_Specification#Instanced_arrays http://www.opengl.org/wiki/Vertex_Rendering#Transform_feedback_rendering -related tutorials I found http://ogldev.atspace.co.uk/www/tutorial33/tutorial33.html http://sol.gfxile.net/instancing.html -instance culling using transform feedback http://rastergrid.com/blog/2010/02/instance-culling-using-geometry-shaders/
location at" should be
0 +
1 +
2 +
not
0+
0+
0+