OpenCL Software GPU Rasterizer Optimization

Graphics and GPU Programming Programming OpenCL GPU render rasterization Performance Optimization

Started by beat April 28, 2023 07:28 PM

0 comments, last by beat 1 year, 9 months ago

beat

Author

0

April 28, 2023 07:28 PM

Hello everyone,

I'm working on a rasterizer that renders triangles in 2d space, all additively (no depth checking). I'm looking for tips on performance optimization of all types. I don't have a lot of experience with OpenCL/memory management etc so I'm very happy to learn about all the ways to improve it.

This is for my lens flare tool so it's essentially rendering bokeh shapes we call "ghosts". I attached pseudo code of the kernel and a visual representation below.

The rasterizer is written in OpenCL and is rendering a set of 2d grids additively. Let's assume I have 100 ghosts to render * 5 wavelengths. This means I have 500 grids to render. To get a smooth interpolation between the 5 wavelengths per grid I am on the fly generating fake grids that sit between two wavelengths and just render those. This leaves me with something like 5 wavelengths * 10 interpolation steps = 50 grids to render per ghost. One step of optimization that is happening is that before rendering, bin_queues get generated. This is essentially just breaking the image into bins/tiles and creating lists of what primitives are visible in which tiles as to only render those. (Following this: https://github.com/a2flo/oclraster)

I have a couple thoughts on how this could be optimized and I'm curious if anyone has other ideas or thoughts about it:

1. Figure out ways to reduce global memory reads. I already tried copying the bin_queues to local memory without speed gains. I have a feeling I need to somehow limit the vertex reads but not quite sure how to do that yet. I can't read vertexes into local memory as there's not enough space for that.

2. Split up different ghosts to different work items. Right now I'm looping through ghosts, wavelengths and substeps per work item, this allows me to only write once to a pixel which is nice but there might be room for better parallelization. If I split up ghosts I have to somehow write to the same pixel from multiple work items. Is there an easy way to go about it? Would this even help?

3. Optimize the algorithm for rendering smooth transition between wavelengths. Currently I'm simply linearly interpolating between two wavelengths, creating 10x (substep) "fake" grids that I render. Maybe there's a better way to blend between two wavelengths.

Again, any advice on any part of the rasterizer is welcome ?

__kernel void rasterizer()
{
// run kernel per pixel
int x = get_global_id(0);
int y = get_global_id(1);
int2 pos = (x, y);

// pixel color
float4 rgba = (float4) (0, 0, 0, 0);

// loop through batches. 1 batch =  256 quads.
for (int batch_id = 0; batch_id < batch_count; batch_id++) {
 // each pixel is in a 64x64 pixel bin / tile.
 // each bin has a bin_queue that holds a list of all quads' visibility for that bin.
 // this was one step I took towards optimizing

 // if none of the 256 quads are visible in this bin, skip
 if (bin_queues[batch_id] is empty) continue;

 // loop through every quad in this batch
 for (int batch_prim_id = 0; batch_prim_id < batch_primitive_count; batch_prim_id++) {

  // if the quad is not visible in this bin, skip
  if (!bin_queues[prim_id]) continue;

  // loop through all wavelengths (one quad per wavelength)
  for (int wavelength_id = 0; wavelength_id < wavelength_count; wavelength_id++) {

   // load the 4 vertices for the current quad, j=0,1,2,3
   // for the actual code I'm only loading on set of vertexes per
   // loop as the other one is the same from the previous loop
   Vertex v1[j] = vertexes[wavelength_id];
 
   // load the 4 vertices for the next quad
   Vertex v2[j] = vertexes[wavelength_id + 1];

   for (int i = 0; i < wavelength_sub_count; i++) {

    // generate a fake quad for every sub step by linearly interpolating
    // between the current vertex and the next one
    float a = (float) i / wavelength_sub_count;
    Vertex v[j] = mix(v1[j], v2[j], a);

    // check if the fake quad is visible on the current pixel
    if (intersect_quad((x, y), v[0], v[1], v[2], v[3])) {
     // get barycentric coordinates and get rgba data of the current pixel
     float4 weights = compute_barycentric_quad((x, y), v[0], v[1], v[2], v[3]);
     float intensity = fragment_shader(weights, v[0], v[1], v[2], v[3]);

     // sample xyz from wavelength and add to rgba
     float3 xyz = read_imagef(light_spectrum, sampler, (float2) (wavelength_step, 0)).xyz;
     rgba.xyz += xyz * intensity;
    }
   }
  }
 }
}

// write final rgba to pixel
write_imagef(image, (x, y), rgba);
}

None

OpenCL Software GPU Rasterizer Optimization

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

OpenCL Software GPU Rasterizer Optimization

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines