Advertisement

Extremely slow in Debug: Converting between pixel formats for pixels held in std::vector

Started by December 02, 2018 10:53 AM
11 comments, last by Arjan B 6 years ago

I started working on a raytracer project (again) and ran into a problem when compiling a Debug configuration.

All I have at the moment is a set of pixels in float RGB format that I convert to unsigned char RGBA format that SFML wants. This happens once per frame, running at 200+ FPS in Release mode, but 1 FPS in debug mode. Please have a look at the attached profiling result.

It seems to spend almost all of its time in std::vector::push_back().

Is there any way to speed up this process? Could I create all elements in a batch and then start filling in values? Is there some handy use of std::transform that I could apply?

Thank you in advance!


std::vector<sf::Uint8> SfmlFilm::ToRgba(const std::vector<Spectrum>& image)
{
	std::vector<sf::Uint8> rgba;
	rgba.reserve(GetWidth() * GetHeight() * 4);
	for (auto spectrum : image)
	{
		const auto max = 255;
		rgba.push_back(static_cast<sf::Uint8>(spectrum.r * max));
		rgba.push_back(static_cast<sf::Uint8>(spectrum.g * max));
		rgba.push_back(static_cast<sf::Uint8>(spectrum.b * max));
		rgba.push_back(max);
	}
	return rgba;
}

 

profiling.PNG

You can try to compile with _ITERATOR_DEBUG_LEVEL set to 0 (or 1), ie.

/D_ITERATOR_DEBUG_LEVEL=0

 

This is one of the main reasons why MSVC can be abysmally slow in debug config, and while certainly handy, iterator debugging is not required in most cases.

Advertisement

Another thing to look for is conditional break points. I usually end up just modifying the code itself if I need one.

1 hour ago, Juliean said:

You can try to compile with _ITERATOR_DEBUG_LEVEL set to 0 (or 1), ie.

/D_ITERATOR_DEBUG_LEVEL=0

 

This is one of the main reasons why MSVC can be abysmally slow in debug config, and while certainly handy, iterator debugging is not required in most cases.

Where do you set this option?

I've tried to add '_ITERATOR_DEBUG_LEVEL=0' to the preprocessor definitions of my project, but i only get compile errors.

... could save me so much time i hope...

 

1 hour ago, JoeJ said:

Where do you set this option?

I've tried to add '_ITERATOR_DEBUG_LEVEL=0' to the preprocessor definitions of my project, but i only get compile errors.

... could save me so much time i hope...

 

My guess is it's in the compile options in project settings, but I'm not at my computer at the moment.

Thanks for the suggestions!

Sadly, setting iterator debug level to 0 made no difference. Seems that range checking on the iterators is not the bottleneck.

Advertisement

It might literally be something in std:: vector that's turned on during debugging. Did you try stepping through it and looking for #ifdef statements? This is one of the reasons I still use my old container classes. Stuff like this never happens, and if per chance it does, it's easy to figure out the problem.

Ok, my mistake was the wrong '=' here: _ITERATOR_DEBUG_LEVEL=0, haha :D

So i got it working, but it does not help (tried it years ago, as i remember now.)

In release build my test takes 114ms, and with debug build still 5600ms.

Unlike OP i almost never push but resize the vectors to known requirements before filling it with data, so memory alloc probably is not the problem either.

 

It remains a mystery why MSVC STL is unusable with debug builds. :(

 

Quote

Sadly, setting iterator debug level to 0 made no difference. Seems that range checking on the iterators is not the bottleneck.

Here are a few other things you could try, if you haven't already (if one of these ideas were successful, you'd probably want to check it in release mode as well to make sure it didn't perform worse than your current solution outside of debug mode):

- Try emplace_back() rather than push_back(). Given that the elements are primitives I wouldn't expect this to make any measurable difference, but it would be easy to try, and even just using a different function might tell you something about where the issue is.

- Instead of using reserve(), create the vector with the appropriate size from the outset and then set the elements using indexed access. This probably isn't a great solution because you pay for element initialization that you wouldn't pay for using other methods, but it still might be informative.

- Use a static vector and indexed access. If the size is always the same you can size it up front, else you could increase the size as needed. There might be an optimization opportunity there even in release mode because of per-instance costs associated with your current approach (such as memory churn and vector creation).

- If you can access the source code, look for other compiler switches that might be having an effect (I think Gnollrunner suggested this). Standard library code can be hard to read and analyze though, so this may not be straightforward.

- Do it 'manually' using a raw array with new/delete, just to see if getting std::vector out of the way indeed solves the problem. (I'm a fan of not reinventing the wheel and of the standard library, but if MSVC's implementation is insufficiently performant in debug mode, you may just have to work around it.)

Note that these are just some ideas of the top of my head, and they may be incorrect or misguided, or not lead to anything worthwhile.

One more suggestion to help with your analysis: perform only one expression per statement in your loop body so you can see where the slowdown is.  Instead of this


rgba.push_back(<static_cast<sf::Uint8>(spectrum.r * max));

break it into three separate expressions.


auto rtemp1 = spectrum.r * max;
sf::Uint8 rtmp2 = static_cast<sf::Uint8>(rtemp1);
rgba.push_back(rtmp2);

Then your timings will show where you should be concentrating your tuning efforts.  Note that once you fix the problem you don't have to leave it like this, it's just an analysis technique.

Also, try making "max" an appropriate floating-point number to avoid so many type conversions.

Also, try making your loop variable an auto const& instead of a copy of the colour vector.  That's unlikely to improve performance much but might eliminate a memory access or three at low optimization levels.

Stephen M. Webb
Professional Free Software Developer

This topic is closed to new replies.

Advertisement