Advertisement

Berkeley Sockets and IOCP

Started by May 28, 2017 11:14 PM
13 comments, last by hplus0603 7 years, 5 months ago

Recently I asked a very seasoned developer on his personal recommendation for receiving packets, queuing, and processing them. To provide a little context, I asked:

Hi,


I've been studying Uru code. I had a question.

I'm looking for how packets are queued and eventually exposed to rest of game. Would you recommend running low level packet recv and send on a thread, while communicating to the rest of the game via event queue? In particular I wanted to see how synchronization primitives were used to make a higher level abstraction for game logic code to utilize.

I'm assuming there's actor model style encapsulation and messaging going on. But like we talked about its hard to see big picture by reading code due to distributed nature.

Thanks!
Randy

And got this response:

Assuming that you're going to be running many game contexts in one process, I have a complicated answer to your question, but the short answer is to receive on worker threads and queue messages onto a per-game queue, which is then scheduled to be run when the worker threads run out of input to receive. GetQueuedCompletionStatus with zero timeout works perfectly for this. There is probably a similar but harder way to do this with epoll, which I've read about but have avoided so far because it seems not as well designed as IOCP and kqueue. Packets should be sent directly from the game worker thread, not queued to be sent. This whole design in general avoids context switching.


The games themselves are essentially fat actors that run single threaded.

So it sounds like he's recommending to use a thread pool to receive packets as top priority. If no packets are ready, the worker threads do packet processing. Here packet processing is opening them up, decryption, or whatever else needed. Then they place these processed packets into a queue for the game thread to pick them up.

Sounds like a totally solid design. The worker threads are reused as much as possible. A more naive design would have jobs submit to a threadpool for packet receives, and separate jobs for packet processing. This involves potential context switches between jobs when the OS does scheduling. Another naive solution is to do packet receive + processing on one job, which does not prioritize picking up new packets, and will have likely context switches once each packet + process job is completed.

That's the context. My question is, are Berkeley sockets implemented as a slightly higher level layer compared to IOCP on Windows? Sort of like how fscanf does internal buffering of syscalls, is there a similar relation to raw Windows APIs for IOCP and the POSIX API?

I'm trying to gauge how high or low level my current POSIX implementation is. I plan on writing some dedicated game server code soon, but don't really want to pay for Windows servers, and was hoping to just keep my POSIX stuff as-is. However, also wanted to learn a little more about IOCP stuff. To understand perf tradeoffs, as I'll be paying for server costs myself.

Hmm you know I don't think I asked my question very well. I'm trying to gain insight on these different options for UDP in terms of collecting packets and handing them off to game instances:

  • recvfrom on custom thread pool, blocking
  • recvfrom on custom thread pool, polling with non-blocking
  • recvfrom without a pool, polling with non-blocking
  • IOCP for Windows
Advertisement
IOCP is a Windows concept, there is no connection to the POSIX standard or to Berkeley sockets at all.

The equivalent notions on other platforms are epoll and kqueue.

Polling a socket is generally wasteful and scales poorly, but it's acceptable in the 1-20 connections range. For larger connection counts or better latency, you are much better off having your reads block a dedicated thread that wakes up when the OS sends them data.

Whether it makes sense to use IOCP or the other solutions depends entirely on your scale needs. If you don't need hundreds of simultaneous connections with minimal per-packet latency, don't bother with complex approaches like IOCP.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Thanks for the reply. Here we see an option (option 3, under the section UDP: Only-One-Socket Problem).

Polling a socket is generally wasteful and scales poorly, but it's acceptable in the 1-20 connections range. For larger connection counts or better latency, you are much better off having your reads block a dedicated thread that wakes up when the OS sends them data.

Whether it makes sense to use IOCP or the other solutions depends entirely on your scale needs. If you don't need hundreds of simultaneous connections with minimal per-packet latency, don't bother with complex approaches like IOCP.

In the link it seems like its saying having reading from socket on a separate thread, and dispatching packets to a pool, is a pretty optimized approach. And to optimize beyond this would be to have multiple threads reading off a socket.

In this case wouldn't polling be a like a spinlock? And blocking be like signal/sleeping? Just asking for a little clarification on what you mean by "polling is generally wasteful".

Also I found it interesting that link describes a couple options and states they seemed roughly equal to IOCP in terms of practical performance. Thoughts?

Multiple threads per socket is not going to be an improvement, though. It just shifts the complexity around.

A socket is a logical unit of communication. If reading from it is done on N threads, and responding to that input is only done on, say, one thread, you have an N-to-1 bottleneck. If response is done on N threads, you have N threads of synchronization nightmares if any of the inputs affect each other - and in a game, that's almost a guarantee.

Polling is somewhat like spinning, yes, in that you're keeping a thread "hot" just to sit around and do no useful work. Blocking lets the thread go dormant until input is ready, much like using a signal or condition variable could wake a thread in general.


The advice you linked is fine for games < about 1000 concurrent connections per host. But my experience is that IOCP (and comparable setups) vastly outscales select() or other asynchronous/non-blocking approaches. The idea is to minimize the amount of time either waiting on a socket or waiting on a thread synchronization primitive.

Artificial benchmarks are one thing, but actual games that need >1k CCU on a single host are rare. I think this contributes to some of the mythology and controversy around asynchronicity tactics. It's hard to reach a scale where you can even measure the differences, but once you do, the differences add up fast.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

FWIW: IOCP and non-blocking sockets are quite different architectures, and actually aim different workloads. BTW - we're speaking about servers, right?

With IOCP, they maintain a thread pool for you, and process each incoming request in-whatever-thread-is-more-convenient-for-them. It means that IF you want to access some-state-shared-between-different-requests - you DO need an explicit mutex lock within your request handler. IOCP is handy - for those cases when, while processing your requests, you do NOT need to lock on a mutex (or when contention on the mutex/whatever-other-resource is negligible). IOCP is explicitly supported on Win-only, but it is possible to write pretty much the same thing on Linux too.

An alternative approach is to dedicate your sockets (like "all the 10 sockets for this MOBA match") to appropriate threads (often, though not strictly necessary - the very same threads which run your game loops) - and to use select()-like function (these days - poll()/epoll()/kqueue()/WaitForMultipleObjects() are superior to select(), but the difference is not interesting at this level) to get the packet as it arrives from any of these sockets (or if nothing arrives - you wait until your next network tick starts). In this case, you do NOT to bother about mutexes and can simply feed the packets from select()-like function into your game loop (it is ok, because everything stays within the same thread). As a rule of thumb - this Shared-Nothing architecture is _superior_ to IOCP-based one at least for games (when applying IOCP to games - it means pretty high contention on the mutexes, causing lots of unnecessary context switching).

An anecdotal evidence from the real world. Once upon a time, there was a game with hundreds of thousands of simultaneous players - and it had servers running on Windows, with ~30 non-blocking sockets per each thread (and using WaitForMultipleObjects() to get the packets), and with ~10K players/server. And at some point an experiment was made to rewrite it to IOCP (because everybody was running around saying "IOCP is THE thing, we're stupid not to use it"). As a result - after spending those several weeks necessary to do it, IOCP did work, but performance difference, while being pretty small (~5%), was in favor of the _original_ non-blocking Shared-Nothing implementation.

Bottom line: IMNSHO, while IOCP may be good for web-browser-like type of load (BTW, historically IOCP was developed specifically for IIS), and while it IS possible to write workable stuff based on IOCP, I _strongly_ prefer classical Shared-Nothing non-blocking select()-like-based approach to IOCP; as a side benefit - it works exactly the same way on Linux too (replacing WaitForMultipleObjects() with epoll() is very straightforward).

Phew (sorry if it is a bit sketchy)...

But my experience is that IOCP (and comparable setups) vastly outscales select() or other asynchronous/non-blocking approaches.

Have to say that my experience is different - see, in particular, that real-world experiment above, plus also there is a plausible theoretical explanation too: Shared-Nothing stuff works better because it causes significantly fewer context switches (context switch, including cache invalidation costs, can cost as much as 1M CPU clock cycles(!)), AND because of better affinity of threads to RAM (which is important for NUMA - and for last 20 years typical "workhorse" servers are 2-socket NUMA boxes). And we didn't even start to discuss stuff such as RSS/RPS/... - where affinity to threads/cores is even more important (but dealing with RSS/RPS/etc. admittedly goes into the realm of black magic). In addition - while IOCP tends to scale worse-than-linear, non-blocking stuff tends to scale better-than-linear(!) - which is a complicated phenomenon which won't fit into the post here. And from a very different perspective - well, Linux is known to scale at least as good as Windows (and without IOCP), which means that IOCP is not really THAT important as MS is trying to tell us :-).

Last but not least on scaling: of course, trying to push _all_ the sockets in one single thread won't scale - but having like 30 sockets/core for 10K players will get us to 300 threads, and 300 threads per 10 cores tends to scale surprisingly well (though probably, the "sweet spot" is even lower - at 50 threads per 10 cores, but this is already game-dependent etc.).

Advertisement

Thanks for the answer No-Bugs Hare!

An alternative approach is to dedicate your sockets (like "all the 10 sockets for this MOBA match") to appropriate threads (often, though not strictly necessary - the very same threads which run your game loops) - and to use select()-like function (these days - poll()/epoll()/kqueue()/WaitForMultipleObjects() are superior to select(), but the difference is not interesting at this level) to get the packet as it arrives from any of these sockets (or if nothing arrives - you wait until your next network tick starts). In this case, you do NOT to bother about mutexes and can simply feed the packets from select()-like function into your game loop (it is ok, because everything stays within the same thread). As a rule of thumb - this Shared-Nothing architecture is _superior_ to IOCP-based one at least for games (when applying IOCP to games - it means pretty high contention on the mutexes, causing lots of unnecessary context switching).

I wanted to clarify here; are you proposing that game instances sit in a dedicated thread, along with associated sockets? So this thread has an entire self-contained loop requiring no state syncs with other threads? Something like this (pseudo code):


while ( 1 )
	packet = select( sockets )
	process( packet )
	send_to_game( packet )

Or were you suggesting something slightly different?

I'm assuming the process could be pretty similar for UDP, but select replaced with a little more custom abstraction.

For a simulation game - it would be more like this (note similarities to the classical game loop):


while ( 1 )
	packet = select( sockets, time_until_next_network_tick() )
        if( packet )
  	  game_process_packet( packet )//similar to game loop's "process_inputs()"
            //MAY just store the packet and process within game_update()
        else//timeout, need to simulate next tick
          game_update()//similar to game loop's "update()"
          game_post_updates_to_clients()//similar to game loop's "render()"

A potentially-better-performing alternative (assuming that you do NOT really need to process packets until the next tick, and for UDP - make sure to test it for losing extra packets due to stack running out of space while waiting(!) - and play with SO_RCVBUF if necessary) could be


while ( 1 )
        sleep( time_until_next_network_tick() )
        while( 1 ) //read all the packets from the UDP stack buffer
   	  packet = select( sockets, 0 )//timeout is 0, no blocking here!
          if(packet)
    	    game_process_packet( packet )//similar to game loop's "process_inputs()"
          else
            break
        //no more packets - proceed with the tick
        game_update()//similar to game loop's "update()"
        game_post_updates_to_clients()//similar to game loop's "render()"

Just for completness on later readers; I have studied multiplatform network IO before )I started to implement it into my game engine and decided to run an abstraction system using the two different but also some kind of similar systems as described in this stackoverflow post

https://stackoverflow.com/questions/2794535/linux-and-i-o-completion-ports

So some kind of IOCP equivalent exists in some way but different

IOCP is vastly superior to non-blocking I/O, not so much because it is inherently bettter (it is not) but because a) you use it with overlapped I/O, which is the "correct" and "natural" thing under Windows, and b) you can do networking and file I/O and some general tasks in one and the same thing, and c) a great deal of work has been done to make them perform well whereas nonblocking sockets are merely a compatibility hack implemented in the winsock library (... which, by design, needs polling).

IOCP is much like epoll or kqueue, only just the other way around. You pass some descriptors to watch to the kernel once, and the kernel keeps it in its books (as opposed to select/poll where you must pass the descriptor set every time), and then you can wait on something to happen as often as you like. But rather than being notified of readiness (epoll/kqueue), you are notified of completion.

IOCP might, for some obscure reason, do some polling (I wouldn't know, and I won't say it doesn't since I don't know) but in principle it doesn't need to do that. You fire off an overlapped operation, and the kernel completes it (with a thread from its I/O pool that is blocked until an interrupt happens).
Then, it signals the IOCP which wakes and boosts your last waiting thread's priority by one level for two quantums, so you have a fast response (no such thing as "high priority" needed, this is done automatically). Wake order is LIFO, not arbitrary, which is also beneficial.

Another nice property of overlapped I/O is that it's practically impossible to have your receive buffer fill up, simply because as long as you have some in-flight operations left over, it's being read without you doing anything, whenever something comes in. Whereas with nonblocking I/O, nothing happens while your thread isn't explicitly receiving. Which might in the odd case cause a packet to get dropped because of a full buffer. That'll admittedly be a rare thing to happen, but it's more likely anyway.

This topic is closed to new replies.

Advertisement