Advertisement

Hundreds of timers...

Started by June 23, 2016 10:06 PM
28 comments, last by hplus0603 8 years, 2 months ago

Hello.

Let's put a player in an MMO, fighting against something/somebody. He has maybe 2 or 3+ buffs, 2 or 3+ DOTs, 2 or 3+ actions waiting for countdown, maybe 2 or 3+ things more waiting too. That's more or less 10/15 timers. Put now 300 or 500 players in the game and you get 5000+ timers !

What are the usual means to manage such a lot of timers ?

Thank you !

Thorna

Stick 'em in a priority queue and pop them off the top as they fire.

5000 isn't too bad, btw.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Advertisement

I have to agree that 5000 isn't too bad.Besides the queue solution you could store a 'tick count' variable where ever you store the buff data indicating the ending of buff/DOT/etc. and have a dedicated thread checking if the duration has ended. I mean a single thread for all buff timers btw not a thread for each buff. Something like:


struct BuffData
{
       // Other stuff
       unsigned long long duration_end;
};
 
int BuffTimerThread()
{
        while(1)
        {
                // Get buff data and stuff
                if(current_ticks >= buff_data->duration_end)
                        ; // Remove buff or push a job to some queue if you don't want to spend time here
        }     
        return 0;
}

This worked pretty decent in my case but if you have a bunch of similar problems it might be a better idea to use a job queue for simplicity and development speed.

You do not need a separate thread to check if one value is larger than another.

It's as simple as ApochPiQ says. Dealing with buff timers will be the least of your challenges if you're making an MMO.

5000 integer compares? If they're in a linear array, and on an x86 chip (with its out-of-order core and clock-speed multiplied integer comparisons) you could probably reach about 6 per clock cycle, or under 300 nanoseconds.

Implement it cleanly and you'll spend more time in cache misses looking up a single item in a spatial grid than you will traversing that entire collection linearly.

If the collection uses something like a priority queue kept inside a ring buffer, the time drops to just checking the first few elements in a linear array, in practice meaning just about the cost required to load the first few elements into memory. It is about as close to a zero-cost as you can get with actual work.

Yes, a priority queue is absolutely the right way to do this.

Separately, never store something like "number of ticks until you're done," because that requires writing every tick to update. Writes to memory cost more than just reads, by a pretty large factor.
Instead, store either "time at which I'm done," or perhaps "time when I started, and how many ticks from start it will go away" (depending on what's easier for your display UI)
Given the "current tick" you can then calculate the current state. Compared to memory writes, additions and subtractions are free, and divisions are very cheap.
enum Bool { True, False, FileNotFound };
Advertisement

5000 integer compares? [...] you could probably reach about 6 per clock cycle, or under 300 nanoseconds. [...] It is about as close to a zero-cost as you can get with actual work.

Well...300ns is a lot cheaper than what I was afraid of !

Yes, a priority queue is absolutely the right way to do this. [...] store either "time at which I'm done," or perhaps "time when I started, and how many ticks from start it will go away"

Well, I asked the question after looking at quite a lot of code in many MMO emulators and it has been my feeling that there was a lot of different ways to manage those timers. My first idea was to ask the OS for a timer each time I needed one but that didn't look well (just a feeling...).

In a huge C# project (DOL), there is a very complicated system of many lists and cache and a dedicated thread which job is only to run through all (I said 5000...) "timer object" and executing callbacks when needed.

In another C++ project (mangos), I didn't find any peace of code dedicated only to run the timers but an "update" call propagated to all objects.... But I don't feel comfortable with C++ and I may have missed something.

I'm going to spend some time in a priority queue.

Thank you.

Just because a MMO emulator, generally written by students with a lot of time but not much experience, does something, doesn't mean that that's the right way :-)
A priority queue has O(log N) or better behavior for insert-new-element and find-next-element-to-run. Typically implemented as a heap.

Typically, the code to run your timer events looks like:


  timestamp_t now = current_timestamp();
  while (!queue.empty() && queue.leftmost_item().timestamp <= now) {
    timer t = queue.pop_leftmost();
    t.execute();
  }
It really is that simple. What will take time is the work actually done by each timer function.
If your timer drives everything (not necessarily a good idea) then you might want to make timer execute() queue something to a pool of worker threads, but that's really only a win if the locking overhead is much less than the CPU cost of running all the events, which means that most events cannot have many dependencies on other parts of the system.
enum Bool { True, False, FileNotFound };

5000 integer compares? If they're in a linear array, and on an x86 chip (with its out-of-order core and clock-speed multiplied integer comparisons) you could probably reach about 6 per clock cycle, or under 300 nanoseconds.

Most likely, it will need not only compares, but also conditional jumps, and each conditional jump is a potential pipeline stall, costing around like 10-20 clocks each :-(.

Yes, a priority queue is absolutely the right way to do this.

Sure, the only question being "what kind of priority queue we're speaking about". If it is something like std::priority_queue<>, it will get a fair O(log(N)) share of pipeline stalls (and probably cache misses too) on inserts. Linear scan-based stuff will be technically O(N), but will be faster for small Ns due to better prefetching (for integers, below N=50 linear scans will be faster almost for sure, above N=500 - very unlikely, in between - It Depends). For OP's N=5000, I'd probably bet on something along the lines of std::priority_queue<>. Then, my Really Wild guesstimate (give or take an order of magnitude) on insert/extract time would be around 1000 clocks (13 jumps on memory with last 4-5 of them being really close and fitting within the same cache line, plus 13 conditional jumps with (?) half of them causing stalls, plus some other unspecified stuff which always happens). If somebody tests it - please share results as it is always interesting to see how much such estimates are off (though within an order of magnitude it should be fine, I hope) ;-).


Writes to memory cost more than just reads, by a pretty large factor.

While I do wholeheartedly support the idea of using "happening at" semantics for "when it is going to happen" (instead of "happening in" semantics), I have to disagree about write costs. As a rule of thumb (though not without its exceptions), writes are _faster_ than reads. Read from main memory will take 100+ clocks (and L3-L2-L1 reads will take around 40-12-4 clocks respectively); writes on modern CPUs usually (give or take, and NOT when you're writing 100% of the time) are around 1 clock (the write is just stored in hardware and most of the time you don't care when it really reaches cache, or whether it causes cache eviction); sure, there are exceptions (after all, it is possible to overload memory bus, and there are also other considerations like inter-core cache invalidation too), but overall - for a normal program it is reads which are expensive (though once again - I am NOT advocating for unnecessary writes).

Read from main memory will take 100+ clocks


{writes} ... are around 1 clock


Ah! You're looking at thread-blocking stall time. I'm looking at overall system throughput.

What will actually happen when you write, is that a cache line will be allocated for the write (which needs bus arbitration on multi-CPU systems,) and the bytes will be put in the right part of the cache line (or, for unaligned writes, cache lines.) Then a read will be issued, to fill the rest of the cache line. Then, once the cache line gets evicted, the memory bus will be used again, for putting the data back out.
At a minimum, this will take twice as long as reads. If you have data structures (or a heap) that alias cache lines to different pointers that different CPUs want to update, they will end up ping-ponging the cache line, costing more than 10x the "base" cost.
Similarly, unaligned writes that span cache line boundaries may cause significant additional slowdowns, depending on the specific (micro-)architecture. (Some RISC architectures disallow unaligned access for this reason.)

Btw: When a CPU thread stalls, waiting for a read to complete, a hyper-threaded CPU will switch to its next thread. Hopefully, that thread was previously stalled waiting for a cache line that has now loaded, so it can still do useful work.

The recommendation is still: If you can do a compare and a branch to avoid a write, that will improve your overall throughput if you are memory bound (which almost all workloads are.)

I think we're somewhat below the useful level of abstraction for timers based on priority queues, though :-)
And, finally: If you "step" each entity in your world each game tick anyway, you might as well just have an array of current attached timers inline in each object, and compare the current game time to the time in those timers; they don't need a queue separate from the collection of all game entities.
enum Bool { True, False, FileNotFound };

This topic is closed to new replies.

Advertisement