Advertisement

Performance issue when checking if bits are set or not

Started by June 17, 2021 10:55 AM
6 comments, last by hplus0603 3 years, 5 months ago

Hi guys,

I have an interesting one that is baffling me.

I have a 2D array of 'unsigned short' type that I am using for making basic 2D sprites (for custom hardware, so that's why I'm not using DirectX or OpenGL if anyone was wondering). Everything is working perfectly but I notice it takes a massive performance hit if the value being checked has many bits set (out of the 16 bits).

I am checking if the bits are set by doing the following.

#define CHECK_BIT(var,pos) (((var)>>(pos)) & 1)

A value in the array like below is very fast

0b1001100100000000

But if I have a value like the following, it noticeably slows down

0b1111111111111111

I am using raster bars on the hardware to get a rough indicator of time.

I'm not understanding why changing the value would cause a noticeable slowdown. There is no additional work being done in the program for whether the bit is set or not. If it's a 0 a GPIO pin gets fired 0, if it's a 1 a GPIO pin gets fired a 1.

Is there something extremely low level that I'm missing here?

Many thanks in advance ?

DividedByZero said:
verything is working perfectly but I notice it takes a massive performance hit if the value being checked has many bits set (out of the 16 bits).

You did not post how the check you do looks like.
Is it a loop to call your macro for each bit, or something else?
Is there work done only on set bits, and is it maybe included into your timing measurement?

Advertisement

The “for custom hardware” and “GPIO pin” stand out to me.

While big chips like you have in the desktop have hardware to handle the shifts easily, many small chips do not. Shift/rotate instructions are not universal. Learn to look at your disassembly when when performance is slow.

Your wording makes me guess you're working on an Arduino device.

If that's the case, most of the official Arduino chips must do a lot of work under the hood, and basically must unroll the loop. Shifting once may take 10 operations, shifting 15 times may take 150 operations. If you're shifting in a loop testing item 1, 2, 3, 4,… that's 1360 one-byte shift operations. It is faster to work directly with the bitmask in a loop rather than shift out the value each time.

Another concern is that time and system ticks are often a 64-bit value. The chip is not a 32-bit chip, so the system does all kinds of work to make the C++ code work. C++ says it's a 64-bit value, so it will go through all kinds of conversions to make sure your math is still valid.

Note that it isn't universal. If you're using Arduino but your actual chip is an ESP32 device, that has far more capabilities and can handle the operations easily. (As a side note, I've got six ESP32's on my desk right now…) Again, when it comes to performance you need to actually check the innermost details. Check the disassembly, measure timings, and sometimes even read hardware manuals to know what is going on.

I don't have an explanation, but I do suspect that it's worth benchmarking against

((var) & (1<<(pos)))

Depending on how your hardware does or doesn't implement bit shifting, it may be more efficient to shift a fixed value than an arbitrary value.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

So, you're checking if a bit is set, and then what? What are you doing with that information?

If you've got code in the form of:

if (CHECK_BIT(v, n)) {
  do_something();
}

…then it's almost certainly the do_something that is the performance problem. And the more bits are set, the more often the do_something code executes.

On the other hand, if your code looks like this:

if (CHECK_BIT(v, n)) {
  // Do nothing
}

…then the code should be completely eliminated by the optimizer, so check your optimizer settings.

That said, var & (1 << pos) is likely to be faster than (var >> pos) & 1, especially in cases where pos is known at compile-time.

Again, suspecting a small Arduino board based on the wording because we haven't heard otherwise…

The low-cost official Arduino devices use cheap ATmega processors and clones. Most are 8-bit microcontrollers that run at 16 MHz or less. A little more processing power than the original NES, but with 1% of the storage space and no video/sound chips.

Since the platform is open to extension on any microprocessor, many other devices have much better processors. I mentioned the ESP32 chips, a popular dual core 32-bit 240MHz processor, basically a SoC device similar in power to ~1998's desktop computers, yet still compatible with the Arduino libraries and tools.

a light breeze said:
That said, var & (1 << pos) is likely to be faster than (var >> pos) & 1, especially in cases where pos is known at compile-time.

The AVR instruction set has LSR and LSL, shifting right or left by a single bit at a time. It also operates on 8 bits at a time. So either way you go, they've still got to loop through the shift. If they're working on a 16-bit number that's two 8-bit LSL or LSR instructions plus carrying the value. A 32-bit number is about 10 instructions. A 64-bit number gives a notable hit on performance. Shifting by 8 at a time as a constant can get compiler optimizations, but not as a variable.

For anybody who does anything significant with the chips, it is well known that the bit shift operations if (data & (1 << n)) (either direction, right or left) slows down as n increases, and that performance drops tremendously as the variable jumps from 16, 32, or 64-bit values. Unless the device has better CPU instructions for the shift, it's going to be a loop with performance linear to n.

If you're looping across them all, it's far better to do a loop like this:

mask = 1; // whatever size you have, int8, int16, int32, int64, doesn't matter.
// inside the loop
{
if(var & mask) { ... }
mask<<1;
}

With that you only have one per loop, giving much better performance.

The second element was this:

DividedByZero said:
If it's a 0 a GPIO pin gets fired 0, if it's a 1 a GPIO pin gets fired a 1.

Setting pins is somewhat slow. It is slower to write a value than to drop power to zero. It does a lot of tests and checks on pin mode, and translation between logical pin number to physical device output method. Then it has to merge with other pins.

Many people who write for these devices will instead write directly to the port. It's easier to write PORTC=value or PORTD=value and set 8 pins at once. You need to know details of how ports are mapped.

On the official Arduino Uno device, digitalWrite for a value takes about 3 microseconds, clearing takes about 1 microsecond, but setting the port directly takes about 0.2 microseconds and can do a bank of them at once.

Advertisement

So either way you go, they've still got to loop through the shift.

The difference is that the compiler can constant fold the (1 << constant) value, but it cannot constant fold (var >> constant)

Also, avr-gcc is generating less and less optimal code with later versions – version 4(!) seems to be the best one, and version 5 is still OK, but using later versions will generally give you worse code, especially around function prologs.

That being said, AVRs are very much on the way out. The Cortex M-0 chips are faster AND cheaper, and the Teensy based development boards are much faster AND have a higher-quality software support library. Even “standard” things like digitalRead() are better on the Teensy side than on the Arduino side. And it's Cortex-M3/M4/M7, which is much faster. The Teensy 4.0 runs at 600 MHz!

enum Bool { True, False, FileNotFound };

This topic is closed to new replies.

Advertisement