Hello everyone,
I'm working on a rasterizer that renders triangles in 2d space, all additively (no depth checking). I'm looking for tips on performance optimization of all types. I don't have a lot of experience with OpenCL/memory management etc so I'm very happy to learn about all the ways to improve it.
This is for my lens flare tool so it's essentially rendering bokeh shapes we call "ghosts". I attached pseudo code of the kernel and a visual representation below.
The rasterizer is written in OpenCL and is rendering a set of 2d grids additively. Let's assume I have 100 ghosts to render * 5 wavelengths. This means I have 500 grids to render. To get a smooth interpolation between the 5 wavelengths per grid I am on the fly generating fake grids that sit between two wavelengths and just render those. This leaves me with something like 5 wavelengths * 10 interpolation steps = 50 grids to render per ghost. One step of optimization that is happening is that before rendering, bin_queues get generated. This is essentially just breaking the image into bins/tiles and creating lists of what primitives are visible in which tiles as to only render those. (Following this: https://github.com/a2flo/oclraster)
I have a couple thoughts on how this could be optimized and I'm curious if anyone has other ideas or thoughts about it:
1. Figure out ways to reduce global memory reads. I already tried copying the bin_queues to local memory without speed gains. I have a feeling I need to somehow limit the vertex reads but not quite sure how to do that yet. I can't read vertexes into local memory as there's not enough space for that.
2. Split up different ghosts to different work items. Right now I'm looping through ghosts, wavelengths and substeps per work item, this allows me to only write once to a pixel which is nice but there might be room for better parallelization. If I split up ghosts I have to somehow write to the same pixel from multiple work items. Is there an easy way to go about it? Would this even help?
3. Optimize the algorithm for rendering smooth transition between wavelengths. Currently I'm simply linearly interpolating between two wavelengths, creating 10x (substep) "fake" grids that I render. Maybe there's a better way to blend between two wavelengths.
Again, any advice on any part of the rasterizer is welcome ?
__kernel void rasterizer()
{
// run kernel per pixel
int x = get_global_id(0);
int y = get_global_id(1);
int2 pos = (x, y);
// pixel color
float4 rgba = (float4) (0, 0, 0, 0);
// loop through batches. 1 batch = 256 quads.
for (int batch_id = 0; batch_id < batch_count; batch_id++) {
// each pixel is in a 64x64 pixel bin / tile.
// each bin has a bin_queue that holds a list of all quads' visibility for that bin.
// this was one step I took towards optimizing
// if none of the 256 quads are visible in this bin, skip
if (bin_queues[batch_id] is empty) continue;
// loop through every quad in this batch
for (int batch_prim_id = 0; batch_prim_id < batch_primitive_count; batch_prim_id++) {
// if the quad is not visible in this bin, skip
if (!bin_queues[prim_id]) continue;
// loop through all wavelengths (one quad per wavelength)
for (int wavelength_id = 0; wavelength_id < wavelength_count; wavelength_id++) {
// load the 4 vertices for the current quad, j=0,1,2,3
// for the actual code I'm only loading on set of vertexes per
// loop as the other one is the same from the previous loop
Vertex v1[j] = vertexes[wavelength_id];
// load the 4 vertices for the next quad
Vertex v2[j] = vertexes[wavelength_id + 1];
for (int i = 0; i < wavelength_sub_count; i++) {
// generate a fake quad for every sub step by linearly interpolating
// between the current vertex and the next one
float a = (float) i / wavelength_sub_count;
Vertex v[j] = mix(v1[j], v2[j], a);
// check if the fake quad is visible on the current pixel
if (intersect_quad((x, y), v[0], v[1], v[2], v[3])) {
// get barycentric coordinates and get rgba data of the current pixel
float4 weights = compute_barycentric_quad((x, y), v[0], v[1], v[2], v[3]);
float intensity = fragment_shader(weights, v[0], v[1], v[2], v[3]);
// sample xyz from wavelength and add to rgba
float3 xyz = read_imagef(light_spectrum, sampler, (float2) (wavelength_step, 0)).xyz;
rgba.xyz += xyz * intensity;
}
}
}
}
}
// write final rgba to pixel
write_imagef(image, (x, y), rgba);
}