Advertisement

Hundreds of timers...

Started by June 23, 2016 10:06 PM
28 comments, last by hplus0603 8 years, 2 months ago

Ah! You're looking at thread-blocking stall time. I'm looking at overall system throughput.

What will actually happen when you write, is that a cache line will be allocated...

Yep, it is a loooong story. However, writing thread still does NOT need to wait for write to be completed.

The recommendation is still: If you can do a compare and a branch to avoid a write, that will improve your overall throughput if you are memory bound (which almost all workloads are.)

Ahh, it depends on the way HOW "memory-bound" your workload is. From what I've seen, real-world workloads (except for DBs) tend to be memory-LATENCY-bound (i.e time while the thread waits for read), rather than memory-THROUGHPUT-bound. Even for DBs I am not 100% sure about it since FSB has been replaced with NUMA 10 years ago...

Alternatively, if we don't want to rely on "whatever I've seen", we can do some math. Xeon E5 can pump up to 100GB/sec at 14 cores (http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/10 ) or about 2.5 bytes/core/tick. While it is certainly possible to write a program pumping more, for any kind of real-world load (except, maybe, for DB) such patterns will be quite unusual. Even more so for simulation games which are traditionally do a bit more than just memcpy :-).

And we're back to the square one ;-). In other words - I contend that most of the loads out there are memory-LATENCY-bound, and therefore writes are cheap (and reads are expensive).

I think we're somewhat below the useful level of abstraction for timers based on priority queues, though :-)

As usual :-).

because of the less critical timing of MMORPGs you may be able to handle this kind of thing without alot of the thread overhead when discrete timers are being used..

Possible way - a sliding (window) 'time' bucket chain (with time increments of 1/5th 1/10th second, or possibly even cruder intervals) to which activator tickets are added to the 'buckets' (link list) by the objects for event timings (debuff, start buff after warmup, etc...)

When the time becomes "current" the tickets get processed and the FSM logic executes whatever (including putting new tickets further down the timeline). If you have LONG delay times, they can be bucketted in a seperate Long Time Chain with bigger time increments (like of 15 seconds) which when they get closer to the current time then move the ticket into the Short Time Chain.


- Fixed length bucket chain (a ring array of pointers with the 'current' moving) - insertion using an offset.

- Possibly Some space allowed for 'past due' in case of server slowdown issues.

- The server now runs the time clock advance allowing constant flexible timing controls (instead of multitude of real timers which now have to be 'patched')

If there are special DEBUF events effecting this whole game mechanic you have complications anyway --- the object would track its own list of "In Effect" Buff data, and so removal from the master timer bucket chain again is an offset and only has an extra link list search to pull the now defunct ticket.

- This detailed mechaism can be used as a general 'wakeup' system though it may be advantageous NOT to generalize it, and instead split into many seperate Timer chains for different flavors of uses.

Also If you can avoid Threads and their overhead, do so - use fibers/microfibers (a software threading mechanism versus system resource hard threads)

Again : MMORPG timings can be pretty loose (imprecise) giving you leeway to do less costly processing

---

As Always -- Free Pools for constantly used data structures/objects (like the 'tickets' mentioned above) to greatly reduce alloc/dealloc overhead

--------------------------------------------[size="1"]Ratings are Opinion, not Fact
Advertisement
Overcomplicated and premature optimization.


The vast majority of Guild Wars 2 game timers run on a single priority queue. On any given map instance there are probably many tens of thousands of timers in the list, with resolutions down to 40ms (the game simulation runs at 25Hz). There is no relevant performance penalty to doing this.

The actual work associated with a given timer (doing damage, changing status effects on a character, causing movement, etc.) is almost always several orders of magnitude more expensive than even the most simple and naive implementation of a timer queue.


Profile and measure before you optimize. It will save you a lot of pain and over-engineering on stuff that doesn't actually matter.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

It depends on the implementation as well... For example with a .net language depending on how you handle your objects you may make terrible decisions and have lots of smaller objects (a seperate one for each timed object, for example) being created per second, leading to a GC spike later on.

Simple resolution would be to create an object pool to store them, removing most of the GC funkyness .net is known for.

But why are you worrying about this in advance? Even if you had this issue it seems like something extremely easy to refactor if it's an issue.

Also If you can avoid Threads and their overhead, do so - use fibers/microfibers (a software threading mechanism versus system resource hard threads)


The main cost of a system thread is the memory for the stack. You don't avoid that cost for a fiber. There are some other costs in pre-emptive threads, mainly having to do with the cost of kernel entry for context switch, but the difference in cost between fibers and threads is not an order of magnitude (except perhaps in some very synthetic cases.)
enum Bool { True, False, FileNotFound };

Also If you can avoid Threads and their overhead, do so - use fibers/microfibers (a software threading mechanism versus system resource hard threads)


The main cost of a system thread is the memory for the stack. You don't avoid that cost for a fiber. There are some other costs in pre-emptive threads, mainly having to do with the cost of kernel entry for context switch, but the difference in cost between fibers and threads is not an order of magnitude (except perhaps in some very synthetic cases.)

I thought the key system 'thread' overhead was the locks and such of making a system level call for the threads being the big cost (would memory (cache misses) be a fiber issue any more than caused by heavyweight thread switching ????). Mechanisms (data) for 'fibers' dont have to be on the stack (I would have thought they most often get preallocated on heap as a utility for other processes to use and avoid alot of locks because they control the processing flow in a linear/sequential fashion).

Perhaps my term 'fiber' (being just a software flat pseudo-thread library) isnt the same as the 'fiber libraries have been developed out there in app-land.

For me it was something I programmed myself with a requirement of the 'fibers' having sleep points to allow lockless/contextswitchingless multiple task operations within a single process.

--------------------------------------------[size="1"]Ratings are Opinion, not Fact
Advertisement
Priority queue is the right thing, but it may be too much. A simple array (vector / deque) will do, too. Sorting an array is fast, especially sorting a mostly-sorted array. Removing elements from the beginning of an array is a memmove. This reduces inserting new timers to one write followed by one sort operation (one for all inserts, not one for each!), and removing old timers to a memmove, or if you use a deque, possibly something less costly. Checking timers is, depending on how you implement it, reduced to checking one integer, or to checking M integers where M is the number of timers that fire during this very simulation step (with M << N). This is very, very little work. Start at index zero, iterating linearly. Stop if(evt.time > current_time). Remove elements before current element. Done.

I thought the key system 'thread' overhead was the locks and such of making a system level call


You don't need locks just because you have threads; it depends on your data structures.
Similarly, you don't avoid the need for locks when you use fibers, unless you apply very strict rules to when you may call fiber-switching functions. Which means things like, "don't do I/O in fibers," which ends up being too restrictive for most cases.
The benefit of fibers is that re-scheduling to another thread/fiber is cheaper in userspace than in kernel space. That's it. If you are drowning in context switches, then fibers may help. But if that's the case, you're probably threading your application too finely anyway.

Mechanisms (data) for 'fibers' dont have to be on the stack


You are in a fiber. You make a function call. Where does the state of the current function get saved? Where do locals for the new function you call get allocated?
Separately, that other function you call, may call a yielding function. Where does the OS save the locals of each function on the call stack while it switches to another fiber?
Of course fibers need stack space. You may allocate that from the heap or other unallocated memory regions for each fiber, just like the OS allocates stack space for threads from unallocated memory space.
enum Bool { True, False, FileNotFound };
I love how this thread ended up in premature optimization land. I've seen the following naïve code in practice:


for all characters:
    for all timers:
        if the timer is ready:
            fire it
            mark it for removal
    remove all timers marked for removal
And it worked fine for thousands of characters. You can optimize it by keeping each timer list sorted by time so that you can stop early as soon as you know there's no point iterating over the rest. And obviously it's optimized further by putting all the timers in one sorted priority queue as suggested above. But you can do that if and when timers start showing up in your profiles.

To add to my previous "just use a vector or deque" post -- not like it really makes a difference since "some thousand" is a no-issue performance-wise -- you could sort by time in reverse order and iterate backwards... that way, removing all events that have already fired would be a single resize().

That would be one push_back (or emplace_back) per event to be added, followed by one std::sort if any events were added. On a deque, push_front might be slightly better since that arguably places new timers closer to where they belong (or not, who knows... and who cares :D ). Or using a deque at all might be 2x slower... I'd still go with a vector, no need to try and be smarter than necessary.

And it wil be something like this for checking all events that fired, and removing them:


auto ev = vec.end() - 1;
while((ev >= vec.begin()) && (ev->time < current_time))
    fire(*ev--);
vec.resize(ev - vec.begin());

Sorting after new timers have been added is the only thing there that consumes noticeable time at all, but a single sort on some thousand things shouldn't be an issue either way for something that happens rarely. The thing that happens all the time (check whether any event fired) is dead cheap, and that's what counts.

Now, if that still isn't fast enough, I wonder if one could get away with inserting at the correct end, then using nth_element with some heuristic, and some extra checks just to be sure the partial sort was complete enough... Good luck finding someone who is willing to maintain that thing then, though. :D

O(M*N) approach

...

And it worked fine for thousands of characters

Sure enough, it does! No surprise you've seen such implementations. :lol:

This topic is closed to new replies.

Advertisement