Advertisement

For loop breaking performances?

Started by January 04, 2021 04:08 PM
10 comments, last by cignox1 3 years, 11 months ago

Hello all,

I'v noticed a very weird behaviour of my code:

for(int y = start_y; y < end_y; ++y)
{

for(int x = start_x; x < end_x; ++x)
{
 Spectrum s;
 vector2<real> pixel_sample(real(x) / real(w), real(1) - (real(y) / real(h)));

 real total_weight = 0.0;

 for(int i = 0; i < aa_quality + 1; ++i)
 {
  vector2<real> sample = pixel_sample;  

  if (aa_quality > 1)  
  {
   vector2<real> d = (Sampler::SampleUniform2D() - vector2<real>(0.5, 0.5)) * (filter.GetWidth() * 2);
   sample += (d / vector2<real>(w, h)); //Randomly offsets the ray if aa > 1 sample per pixel
  }

  Spectrum dofs;

  for(int j = 0; j < camera->GetSamplesPerPixel(); j++)  
  {

   vector2<real> dof_sample = sample;

   ray<vector4<real>> rr = camera->GenerateRay(dof_sample);
   rr.maxdistance = Q_INF;

   SurfacePoint point;
   point.previousIOR = Spectrum(1.0);


   real w = filter.Evaluate(dof_sample);
   total_weight += w;

   Spectrum val = qlRenderer->Trace(rr, point, 6);

   dofs += val * w;    
  }

  s += dofs / 1;
 }
 s = s / total_weight;

 buffer->SetPixel(s, x, y, real(0));

}

That's the main loop of my raytracer. That just temp code, is not yet part of the core rendering framework.

I have a test scene that renders in 1 sec as long as aa is set to 1 sample per pixel (variable aa_quality). If I set it to 2, rendering time jumps up to 25 secs!

If I comment out the for statement, leaving only its body, I get back to 1 sec. If I hard-code “2” I get 25 secs.

As far as I can tell, there is no reason for it to behave this way: repeating the loop twice should lead to twice the rendering time.

Even stranger, if I force the most inner loop (for depth of field) to repeat twice (with aa loop executing once):

for(int j = 0; j < 2; j++)

, nothing happens, I get twice the rendering time despite aa and dof loops being essentially the same thing: in both cases 2 rays are shot into the scene.

If that was something related to the optimizer, I would expect both loops to break it.

Any idea? (working with visual c++ 2019 compiler).

Thank you!

I've tested it a bit further and noticed that I can still get the fast performances if I comment out the most inner loop. Looks really like visual c++ is skipping some optimizations with a 4 levels nested loops, or I encounter some memory access issue (cache?) that really hurts the performances.

Any idea? I could try to reduce the procedure to only 3 nested loops by grouping aa and dof together. At the end of the day, it might be a good idea anyway…

Advertisement

Never mind, still got the issue. Now my loop looks like:

 for(int y = start_y; y < end_y; ++y)
{
 for(int x = start_x; x < end_x; ++x)
 {
  Spectrum s;
  vector2<real> pixel_sample(real(x) / real(w), real(1) - (real(y) / real(h)));
  real total_weight = 0.0;
  for(int i = 0; i < 1/*pp_samples*/; ++i)
  {
   vector2<real> sample = pixel_sample;  

   if (aa_quality > 1) {
    vector2<real> d = (Sampler::SampleUniform2D() - vector2<real>(0.5, 0.5)) * (filter.GetWidth() * 2);
    sample += (d / vector2<real>(w, h)); //Randomly offsets the ray if aa > 1 sample per pixel
   }

   Spectrum dofs;
  /* for(int j = 0; j < 2; j++)  
   {*/
    vector2<real> dof_sample = sample;
    ray<vector4<real>> rr = camera->GenerateRay(dof_sample);
    rr.maxdistance = Q_INF;
   
    SurfacePoint point;
    point.previousIOR = Spectrum(1.0);

    real w = filter.Evaluate(dof_sample);
    total_weight += w;
   
    Spectrum val = qlRenderer->Trace(rr, point, 6);
     
    dofs += val * w;    
  // }
   s += dofs / 1;
  }
  s = s / total_weight;
 
  buffer->SetPixel(s, x, y, real(0));
 }
}

In order to have full speed, I must comment the aa offset:

   if (aa_quality > 1) {
    vector2<real> d = (Sampler::SampleUniform2D() - vector2<real>(0.5, 0.5)) * (filter.GetWidth() * 2);
    sample += (d / vector2<real>(w, h)); //Randomly offsets the ray if aa > 1 sample per pixel
   }

If this code is enabled, my performances drop by 10x.

Edit: things get weirder by the second: so with 4 threads I render my test scene in 1 second. Great. With 2 threads: 2 seconds. Fine, as expected.

But what if I only use 1 thread? Render time: 30 secs. That's 5 times slower than expected. And what if I enabled the aforementioned strangely behaving code in the loop? 30 secs!! It looks like that aa code which used to break performances doesn't do much damages when running on a single thread.

I don't really undestand what's going on here…

If you say Visual C++, do you test performance in Debug or Release mode and what optimizations do you use?

Maybe MSVC Compiler is not clever enougth so try something differen, LLVM/clang for example

No idea either, but it reminds me on recent similar experience.
I was working on a fluid simulator, and like you i used a templated math lib for vectors and matrices (glm). The inner loops also have nesting level of 4.

Later i replaced glm with what i use usually (Sonys SIMD lib that came with Bullet Physics). This code is 10 years old, supports only SSE and i never did an update. It is pre C++11 - neither numbers nor dimensions are templated.

The speedup i got was an unbelievable 10. Sonys lib also has a version without SIMD intrinsics, which is only 10% slower. So that's not the reason and auto vectorization seems to work well with MSVC 2019.

I suspected templates, because profiler showed many samples going into glm constructors. But i did not look at assembly to find the real reasons. Maybe nested loops affect template compiling too - did not think about that. I want to reduce loop level anyways so maybe i can still switch back to glm and see if this makes a difference…

Whatever - something seems very wrong with MSVC. Games usually use Clang, but i guess it's not easy to fully replace MSVC with VS. The installer has options for that (which i did not try yet), but i've herad this only replaces some kind of backend and more configuration is necessary.

Shaarigan said:

If you say Visual C++, do you test performance in Debug or Release mode and what optimizations do you use?

Maybe MSVC Compiler is not clever enougth so try something differen, LLVM/clang for example

Release, with O2 enabled. I think I'm going to try another compiler, if I'll work on that spare time project further (I abandoned it 10 years ago and only went back to it in the last few days because of some more spare time :D).

JoeJ said:
and like you i used a templated math lib for vectors and matrices

I'm using my own math lib. I don't think that's the issue, because millions of vectors/matrices operations are being made event with the “1 sec” version. But if I just add those 2 lines of code, suddently render time increases by 5x…

Advertisement

And thank you both!

cignox1 said:
I think I'm going to try another compiler

Let us know your results, if you get at it… : )

Building boost proved to be kind of a pain (in fact, I had to manually rename the lib files because I could not make boost generate the clang-win libraries) but I eventually could compile via clang. It looks like the issue is less punishing now :D

Still my raytracer is slow as hell, I'll investigate additional optimizations of the code (first guess, vectors/rays)

What does your profiler tell you? Most IDEs for Windows these days have profiling built in, you can identify exactly which statement is taking the time. The code you posted has template types with overloaded operators so likely one of them is the issue.

Once the profiler points out the slow ones, you can also pop open the disassembly to see exactly what is going on at every step. It could easily be that one of those seemingly-simple overloaded operations is triggering hundreds of lines of complex operations, or even has some debugging logging or something hidden to you but still enabled.

This topic is closed to new replies.

Advertisement