Advertisement

Thread synchronization timing precision

Started by September 16, 2021 04:50 AM
8 comments, last by All8Up 3 years, 1 month ago

I have a renderer that operates on a dedicated thread. The rendering frequency is independent from the world update frequency, so my framerate can be very high while the world updates at a steady 60 Hz.

I'm experiencing a lot of problems with the timing of the data coming from the main thread to the rendering thread. Each interval should be roughly 16.667 milliseconds, but I am seeing numbers like below. The renderer interpolates between the position of the previous two frames of data received, but with the timing imprecision I am seeing a lot of jerky motion.

How can I more precisely synchronize thread timing?

  • 14.607
  • 16.139
  • 14.886
  • 15.619
  • 15.339
  • 14.614
  • 17.032
  • 14.538
  • 15.772
  • 16.56
  • 30.511
  • 1.984
  • 30.747
  • 14.103
  • 16.712
  • 16.728
  • 13.97
  • 15.113
  • 16.71
  • 15.148
  • 14.758
  • 15.328
  • 16.726
  • 14.814
  • 15.491
  • 30.86
  • 15.431
  • 16.623
  • 14.311
  • 16.118
  • 14.889
  • 15.242
  • 16.588
  • 16.544
  • 15.248
  • 14.76
  • 15.093
  • 15.765
  • 15.345
  • 32.86
  • 30.106
  • 14.885
  • 14.996
  • 15.562

10x Faster Performance for VR: www.ultraengine.com

The problem was used by the use of the Sleep() command in the main thread, which apparently has a very wide margin of error.

10x Faster Performance for VR: www.ultraengine.com

Advertisement

First, from the nature of the underlaying OS itself, it might not be possible to always receive exactly 16.667 ms, as you already see in this number, the number itself is already infinit. If you involve sleep(), then there are scheduling delays coming from the OS so this gets even more unprecise and finally you're forgetting about OS messages appearing in your game, so this might also interrupt the framerate depending on how much the OS has to communicate to the program, mouse movements for example.

The common solution to get a factor at which you can for example animate something is either calculating a delta time from every frame or used a fixed timestep. Both has been discussed already a lot on the forum, so you might want to have a look

Sleep() is very jittery – you're probably better off using WaitForSingleObject() with a timeout. But even so, it's not highly precise. Windows doesn't have a good analog of Linux usleep().

For a game, your best bet is to turn on vsync, and then measure actual time elapsed, accumulate it into some counter, and step your simulation once per 16.66 milliseconds accumulated. For each step, subtract 16.66 milliseconds from the accumulator. Once every few thousand frames or so, you might end up stepping the simulation twice for a frame, or no time at all, which in practice is totally acceptable.

You can also say that “if I've run for 10 frames and the accumulated time delta is less than 1 millisecond, set the accumulated delta down to 0,” which would let you run at the exact frame rate, as long as there are no big hitches.

enum Bool { True, False, FileNotFound };

This is a very common Windows issue which can be solved in several ways. Depending on how accurate you wish things to be you can generally narrow things down to one of two different solutions.

Solution #1. At the startup of your application, call “timeBeginPeriod” with a value low enough to give you reasonable accuracy. Generally speaking, if you set the period to 1 you should see greatly increased accuracy. Your jitters should be reduced to about 10-15% and you probably won't see the huge spikes any longer. This is about as good as you are likely to get without using a significantly more involved solution like #2 below.

Solution #2. First, you have to understand what is causing the issue in a general manner. (I say general because the details are complicated and I'm going to simplify a LOT.) In Windows, the operation of sleep/mutex/etc send threads into a priority queue ordered by the time they are supposed to wake up. This is fine but the manner in which Windows checks for things to wake up is the source of the issues. There is a tick rate (like a game frame rate) as which something checks for items on the queue to wake. This fixed rate is the source of the jitters. Assume you put a thread to sleep for 15ms and the exact moment you put the item to sleep the OS has just barely passed the last check, we'll say you are 1ms past the last tick. So, the next tick your item will still have 1ms to wait before it should wake up, but because of this fixed rate you will not wake up for 14ms+15ms, i.e. 14ms till the next tick where your item still has 1ms to wait plus a full 15ms for the next tick before it is checked again.
So, with that information in hand and a bit of spelunking in Windows internals, you can build a considerably more accurate solution. Basically when you would normally sleep you subtract the tick rate from the time and if still more than 0, go ahead and sleep for that time. Now you spin wait on the remainder. It's ugly but does work and gives you a much more accurate timing.

Now, give the above, a few notes:

  • Do not call “timeBeginPeriod(1); sleep(1); timeEndPeriod();" sequences. This drives Windows nutty as it rejiggers the internals constantly. The overhead is significant and also tends to screw up other programs/drivers which require high precision timing. Unfortunately these calls are globals and you are impacting the entire system, globals are evil as we all know.
  • #1 will still give you jitters. You can get the error rates down to 4-5% but using a bisecting sleep. This is kinda like #2 but without the busy wait. You sleep the known “good” time then you sleep ½ the remainder, recheck your time, sleep ½ the new remainder etc till close enough or the sleep would be ≤ 1ms. I've rarely found any need to go with the full #2 solution in reality.
  • #2 should use the mm_pause instruction. Unfortunately in Intel's wisdom, it is no longer a constant over CPU's. Specifically newer Xeon's use a different (longer I believe) delay during the instruction meaning you need to detect the CPU and rev to figure out if you need to use a different loop size. If you care about this, it just adds some complexity.

Hope this helps.

What I am doing right now is I just continually check the time:
while (true)

{

lastupdatetime = double(Microsecs()) / 1000.0;

if (lastupdatetime - tm >= delaytimed) break;

}

Of course this makes the CPU usage very high for that thread, but I don't see any other way to get accurate timing.

10x Faster Performance for VR: www.ultraengine.com

Advertisement

That is, of course, about as accurate as you can get in reality. But, it is pretty brute force and as you've found, harsh on the CPU utilization. At an absolute minimum you could be calling the CPU's pause instruction at least once within the loop in order to let the CPU know this is a spin wait. For a complete solution you want to do two things:

#1 At startup perform a little timing operation which figures out the duration of the “mm_pause” instruction. I typically do this looking for the number of times to call the instruction which is equivalent to about 1000ns. I also do this till I have 5 good samples meaning that if I see a few outliers, I throw them out and try again. An outlier would generally be anything more than 25% different than a prior sample. If it's higher, throw out the current sample, lower throw out prior samples.

#2 Change your code to a decay loop:

// Assumes #1 is stored in Spin1000ns.
// Delay time is in HPC ticks, convert externally for your target.
void decayTimer(uint64_t delayTime) {
  uint64_t current = HPCGetCounter();
  uint64_t target = delayTime + start;
  
  while (current < target) {
    // Convert delta from HPC ticks to MS.
    uint64_t ms = HPCToMs(target - current);
    if (ms > 2) {
      // Assuming you've call timeBeginPeriod(1) somewhere..
      // Due to all the things Windows manages, it is almost guaranteed this will run
      // longer than you want it too so we sleep for "half" the time.  If you still
      // see occasional spikes, increase the (ms > 2) check, I believe I use 5.
      Sleep(ms / 2);
  } else {
    for (uint64_t i = 0; i < Spin1000ns; ++i) {
      _mm_pause;
    }
  }
  
  // Update current.
  current = HPCGetCounter();
}

This should chop the CPU busy time down significantly without giving up timer accuracy. If you need greater accuracy, you can take this a bit further and use a decay variation of the “Spin1000ns” loop, i.e. start at 1000ns, then use 500ns, then 250 etc etc till close enough. (NOTE: Double check the above, I'm sure I probably added a nice bug for the reader to figure out…. :D )

Thanks for the info!

10x Faster Performance for VR: www.ultraengine.com

Beware that _mm_pause may not take a fixed amount of time even on a single system.

The use of the instruction is mainly to tell the rest of the system that you're inside a spin loop, and thus waiting on some cache line synchronization. What the system actually does with that information may vary from “nothing” to “wait an exact number of cycles” to “depends on complex multi-socket architectural state.”

If a cache line is dirty or reserved or unavailable may change the duration. Whether there is other bus traffic or not may change the duration. Which particular core and socket you're running on may change the duration.

If you want repeatable timing on Intel microarchitecture systems, use either the RDTSC instruction and pray that it's not varying because of power states, or use the HPET infrastructure, which is defined for microsecond-accurate cross-power-state cross-CPU timing just like this.

On modern versions of Windows, QueryPerformanceCounter() can work pretty well, for example. Or, more precisely: It's the best possible timer you can get, because the kernel will choose the best implementation for the available hardware. Use it!

That being said, what kind of external hardware are you synchronizing with that makes you need this kind of resolution, rather than just rely on blocking hardware like a sound card buffer callback or a graphics card presentation event?

enum Bool { True, False, FileNotFound };

@hplus0603 All true but keep in mind, the snippet is basing all timing on the HPC, the pause is just there to make the loop less of a bane to the CPU. Given that on Skylake and almost all processors (AMD included) other than more recent Xeons, the instruction is 140 cycles, it will vary “time wise” based on core boost and all that other nonsense. But, since the loop checks the HPC for timing and doesn't rely on the the 1000ns approximation, the loop still comes out very accurate. The reason you want to use pause is not for timing reasons, it's simply a case of reducing strain on the ALU that has to increment the loop and compare the value. As stupid simple as that sounds, a tight loop will saturate the ALU, with the pause instruction it will happen 100ish times less often and not blow out the ALU. Side effect, the CPU “looks” like it is idle in such a loop versus actually doing stupid work at an incredible rate.

Also of note, just spinning on the HPC is MUCH worse than spin waiting with or without pause. HPC is not a free call and has some thread related overhead, so pausing as much as possible to reduce the HPC calls is desirable.

This topic is closed to new replies.

Advertisement