Advertisement

MMOG Server-Side. Front-End Servers and Client-Side Random Balancing

Started by January 04, 2016 08:01 AM
34 comments, last by Sergey Ignatchenko 8 years, 9 months ago


You absolutely need to shard the simulation based on the virtual world position.

If your world is large enough - sure.


Which, in turn, means that it's more efficient to have view/front-end servers dedicated to particular geographical streams.

Ah., so you don't argue with Front-End Servers, phew biggrin.png . From my perspective, Front-End Servers are "almost universally a good thing", and client-side Load Balancing is good on a case-by-case basis.

I wrote a little bit about this possibility to "map" some of Front-End Servers to specific Front-End Servers under "Affinity" section, but will need to elaborate more, THANK YOU! That being said, I don't really like such "affinities', because they make deployments much more messy (as pools of Front-End Servers become small, having a reserve server for each pool becomes unfeasible, which leads to requirement of server failure detection on the server-side, which in turn means that it needs to be made SPOF-free, and so on, and so forth). It is doable, but is a quite a big mess on the admin side.

On the other hand, as long as published/synced states are rather small (and they should be, as they should go over the network) I would seriously consider a possibility to have random distribution of the clients over all the Front-End Servers regardless of virtual-world location. While it will indeed need more Game-Server-to-Front-End-Servers traffic (which in turn means performance hit, though not that big, as most of the work done by Front-End Servers is serving clients), it will simplify management of Front-End Servers greatly (to the point of "we just need to keep one more server than we really need to serve clients, that's it, no scripts to make mistakes with, no additional hardware, nothing"). And "less management" means "less chances to screw-up". I am not trying to say that this will universally work, but one shouldn't underestimate deployment-time complexities when running a hundreds-of-servers farm, so if there is a possibility to simplify deployment by an order of magnitude at the cost of, say, 5% performance hit, I would certainly jump at this opportunity. And whether the hit will be 5% or more - depends on the game.

If your world is large enough - sure


We're talking MMOGs here. A world that can support 10,000 players is orders of magnitude larger than a BattleField or Quake level.

I don't really like such "affinities', because they make deployments much more messy


Have you actually worked on and deployed a MMOG? It's not about "liking," it's about pure possibility.

as pools of Front-End Servers become small, having a reserve server for each pool becomes unfeasible


We solved that by having "pools" be defined by the software they run, not the configuration they use.
There's not a "reserve front-end server for the Faerie Isle" as opposed to a "reserve front-end server for the Canyon of Doom."
There's one (or more) "reserve front-end servers" and when something goes down, you insert the host into the correct spot with a minimum amount of re-configuration.

If it turns out that front-end serving is low-CPU and you actually want multiple "zones" mapped to the same physical host, you can use containers or some other multi-tenant or virtualization option to make that happen.
Although beware virtualization in twitch-sensitive games: some hypervisors and some OS-es in some situations cause unacceptable scheduler delays where your virtual server may be "gone" for hundreds of milliseconds or even a full second. That won't work for a MMOFPS ;-)

Anyway, it turns out that the role of "filtering the data stream" is much simpler than the role of "physically simulate a world," so it's often convenient to simply make front-end-server == simulation-server for the vast majority of your world. Only really big events, where only some people participate, and most spectate, such as MOBA tournaments or whatever, actually need the fan-out part in practice.
enum Bool { True, False, FileNotFound };
Advertisement

Have you actually worked on and deployed a MMOG? It's not about "liking," it's about pure possibility.

Ok, as you're asking (BTW, thanks for the opportunity to brag about it biggrin.png). Besides being a co-architect of a G20 stock exchange 20 years ago, and a Chief Architect of a game which (while being non-simulation) still processes around a billion user messages a day, during last 10 years I've performed quite a bit of consulting (due diligence, etc.). This consulting/due-diligence included quite a few games, including MMOs in your (quite narrow) definition of this term. Which puts me into a quite unique position to generalize across the whole spectrum of very different over-the-Internet games. And two-and-a-half most important lessons I've learned in the process are the following:

1. All the games which work over the Internet and have over 100 players at the same time, have a lot in common. Whether it is an MMOFPS or a stock exchange, they still have a "game world" which processes user inputs and simulates whatever-logic-you-want-to-throw-in, and pushes updates to the clients. Of course, there are lots of differences (UDP vs TCP, very different ways to compress the data, client-side predictions, name-your-poison), but there are still striking similarities (just two examples include state sync concept which is universal and absolutely necessary for all the games, despite implementation differences, single-threaded event-processing loop is universal across the board, and so on and so forth). As a result, I tend to name all such games as MMOs (I hate to argue about terminology, but this definition also seems to be supported by Wikipedia, which lists rogue-like Gemstone IV as a very first MMORPG).

1 1/2. In spite of a popular opinion, simulation games are not that different from the rest when speaking about protocols and overall architectures. From the very high-level point of view, it still has a server which receives user inputs and calculates state, derives publishable state out of it, and pushes this publishable state towards the clients (filtering it for different clients when/if necessary). Only implementation details (such as TCP vs UDP, complexity of calculations, etc.) are different, but even these are not black-and-white (just as one example, MMORTS are known to use either UDP or TCP, and while UDP tends to work better, it can be playable with TCP too). Instead of being fundamentally different from the rest of the game world, simulation games are just sitting on one end of spectrum (with the opposite end of spectrum occupied by farm-like social games and stuff such as Lords&Knights). While the difference between two ends of the spectrum is drastic, there are lots of things in between (including, but not limited to casino-likes, stock exchanges, arenas, and MMORTS) which make it a kind of "continuous spectrum", with neighbours having lots in common, but ends being indeed drastically different.

2. For each problem there is a different spectrum of solutions, ranging from "it will never work", to "optimal one". And close to "optimal one" there is usually the solution which will work, but has practical drawbacks. Just as one example: blocking RPCs in game loops "will never work", but when it comes to the choice between "messages" and "non-blocking RPCs", the choice is based on game specifics and even some personal experiences. It is in these cases of choosing between two solutions where both will work, when I'm speaking about "likings" (and yes, this choice is quite subjective, not to mention that it depends on game specifics a lot).

Now to your solution (the one "with a minimum amount of configuration"). Yes, it will work (and that's what I've meant when I've said that "It's doable, but..."). But no, for a wide spectrum of games it is not the only one which will work. Moreover, from what I've seen, any kind of server-side failure detection tends to cause trouble (this is a generalization over a few dozens of systems and many dozens of system-wide failures I have seen and was told about). That's why I (whenever possible) strongly prefer to avoid making decisions about server failures on the server side at all. And client-side load balancing does exactly that - it means exactly zero configuration on the server side to make system handle failure of one of Front-End Servers (and no special trickery such as virtualization is necessary either); in other words, failure handling when we have client-side random balancing, is KISS in it's almost pure form. That's exactly why I "like" it better (YMMV, batteries not included). It MIGHT happen that it doesn't work for your game - but it MIGHT as well work, so writing it off without taking into consideration is not exactly wise.


Anyway, it turns out that the role of "filtering the data stream" is much simpler than the role of "physically simulate a world," so it's often convenient to simply make front-end-server == simulation-server for the vast majority of your world.

This is another good example of what I name "liking". Your experience pushes you to combine Front-End Servers with Game World Servers. Mine pushes me to play it another way around (unless game specifics will show otherwise). However, we seem to agree (correct me if I'm wrong) that both these approaches will work, and that the difference is not that drastic, and that neither of these approaches will be a Fatally Wrong Decision which will break the game (especially as, with the right architecture, it can be changed down the road if necessary). IMHO, this is a good point of mutual understanding, as (unlike "it will never work" stuff) it is all about personal experiences and personal judgements (which are inevitably different for different people).

Peace?

Peace?


You made lots of good points in your discussion above, thank you for that.

The point I have made twice, and you have not addressed, remains:

For games where players interact based on some game geography, your simulation (and thus view) servers need to be aware of that geography.
This, in turn, ends up dictating that view servers that attempt to serve the entire world will introduce N-squared data problems (all sim servers talk to all view servers) as well as capacity limitations (the maximum size of your world is limited by the amount of RAM you can access in one tick on one view server) where using affinity does not.

Thus, there are several reasons why using random load balancing for view servers will break, and I haven't heard a single plausible argument for why it would be worth it yet. The one argument I heard was that "if you have stand-bys per affinity group, that's a lot of stand-bys" which I don't think is a real argument, because no sane operations person would actually need to do that.
enum Bool { True, False, FileNotFound };

view servers that attempt to serve the entire world will introduce N-squared data problems (all sim servers talk to all view servers) as well as capacity limitations (the maximum size of your world is limited by the amount of RAM you can access in one tick on one view server) where using affinity does not.


Ok, if you insist, but it is going to be even longer than my previous post.

------------------------------

First, let's speak about amount of RAM we can access in one tick.

Let's do some maths. It will require some assumptions, and YMMV, but there are games out there which will more or less fit within these numbers. First, let's assume not-so-uncommon 20 network ticks/second. Second, let's assume that we're using standard 2-socket boxes as our Front-End/view servers. Currently, the "sweet spot" for these boxes (price/performance wise) lies around 2x6-core boxes, 2x(12-15M) L3 cache, and 8G RAM (they can be rented around EUR150/month before any discounts, which is damn cheap). Let's take one of them (with 2xE5645) as an example.

Now, let's calculate usable RAM bandwidth. Of course, I could take official 32GBytes/second per socket, and get to nicely looking 32*2/20=3Gbytes/tick, but we both know that saturating memory bandwidth in any realistic processing is utopia :-) . To get more realistic estimate, we need to take RAM latencies into account. Each random (non-prefetched) read from main RAM costs around 100 CPU clocks (even if it is goes to 200 under really heavy RAM load, it won't change the end result); let's assume that our code reads the whole 64-byte cache line (which is around one object's state BTW) on each "main memory" access. Then, for 12 threads running on those 6 cores of one socket, we'll be able to access 12 * 2e9 clocks/sec / 100 clocks/main-RAM-read * 64 bytes/main-RAM-read = 1.5e10 bytes/second, or 750MBytes/network-tick. For 2 sockets, it goes up to whopping 1.5 GBytes/network-tick (which is suprisingly not that different from official numbers, indicating that E5645 is well-balanced for our purposes). And when we're speaking about publishable (!) states (which should have frequently-modifiable-part within 50-100 bytes/player anyway), 1.5GBytes is a Damn Lot of RAM (and that's even before starting to think about locality and prefetch, and about benefits provided by caching). I won't get as far as claiming that RAM-wise it is possible to support 1.5GBytes/100bytes = 15M players (mostly because such a claim is very much meaningless), but still, the number is damn huge, and view/Front-End Servers will become CPU-processing-bound much earlier than RAM-access-bound. As a result, I don't really see RAM as a limiting factor at least for a very wide range of games out there.

------------------------

Now, to those N-squared problems when M sim/Game-World servers are speaking to N view/Front-End Servers. While it is indeed NxM links total, from point of view of any of Game-World Servers or Front-End Servers, it is only M or N links respectively. Therefore, from load-on-each-server perspective, what we're speaking about, is actually worse-than-linear scaling. The actual question here is about "which portion of resources will be eaten by handling those links on each of the servers". Let's make a few very wide guesses. Let's assume a world with 30 Game-World servers simulating 10000 players (this is the point where numbers vary greatly, but well, I need to start somewhere). And then let's assume another 30 Front-End Servers handling it. Let's further assume that (uncompressed, no-reckoning, etc.) stream for updating each (moving) player, is 50 bytes/tick. Then, updates to all those 10K players will get us around [EDITED] 50 bytes/player/tick *10000 players*20 ticks/second=10MBytes/sec. This is amount of updates received by each of the Front-End Servers. Amount of updates sent by any of the Game World Servers, will be 10000/30 players * 50 bytes/player/tick * 20 ticks/second * 30 Front-End-Servers = the very same 10MBytes/sec. These numbers are very far from causing any kind of trouble for modern servers. Amount of traffic in the whole server-to-server network will be 300MBytes/second, but this amount won't apply to any specific server[/EDITED]

----------------

Basically, my answer to both your concerns above is along the lines of "yes, this is a potential problem, but most likely it won't hurt for a long while".

That being said, if we go further and further with numbers, your point about these two issues will indeed become more and more important. At some point (though I am arguing that it will be later rather than sooner, see above), this will become a classical scalability issue. Scalability issues as such are beyond the scope of Chapter VI (I didn't want to scare readers beyond what is absolutely necessary), and this thing is one of the dozens of scalability issues which I'm planning to discuss in Vol. 2.

Strictly speaking, to illustrate my point that there are at least SOME games out there which will work fine without affinity, the above reasoning should do. But as your concerns might indeed make an impression that this whole thing doesn't scale (which would be pretty bad), I will go an extra mile to explain it further.

IF (and I still insist there is a pretty big "IF" here) one of these things above ever becomes a problem, then we will need to introduce affinity. However, in a wide class of cases we'll be able to play it in a different-from-you're-doing manner (and without that dreaded server-side failure detection). As a Plan B for such a possibility, I'm planning to play it in the following way (this description was originally planned for Vol.2):

- keep N-view-server x M-sim-servers (with client-side random balancing) as long as feasible
- at the point, IF/WHEN it becomes unfeasible - split both view-servers and sim-servers into two roughly-even groups, with typical (client-side) "sharding" *between* the groups, and client-side random balancing *within* each of the groups(!)
- as described above, by this time I'm very much expecting the size of the view-server groups to be rather large (at the very least 5-10 servers). This means that we can easily keep a per-group spare without incurring too much costs, and without introducing server-side failure detection (which is certainly one of my pet peeves; I really _hate_ these things as they fail on me much more frequently than they're expected to).
- as soon as two groups become insufficient - split them into three groups, and so on
- yes, re-sharding can become a problem, but in practice there are ways to deal with it
- this approach completely eliminates O(N^2) (as soon as the size of each group is limited at MAXN, it instantly becomes O(N*MAXN)=O(N)), and also allows to put a hard limit on RAM capacity if it ever becomes a problem.

---------------------

Phew, I hope that I've managed to explain it good enough.

To summarize:

Yes, what you're saying are valid concerns. However, I'm saying two things. The first one is that "in practice, I don't see these problems happening for a long while for a whole damn lot of the games out there" (and I've never seen it myself to be a problem, my problems with these architectures were of the different nature, mostly with "game worlds" being ultra-small rather than ultra-large). The second thing I'm saying is that "even if it becomes a problem, there is a relatively simple way to handle it without changing overall approach, and keeping random client-side balancing".

That's long, but I think your assumptions aren't quite right.

First, each simulation server needs to provide the full state of the world to each view server. This is more than 50 bytes per player per tick. And it multiplies by the number of view servers, on each of the simulation servers (hence, NxM.) I mentioned RAM and CPU, and there are also other constraints. 60 servers for 10,000 simultaneous online is ... not unreasonable, all in all, but you can do significantly better. (I think our current messaging cluster for my current place of work uses about 15 servers for > 100,000 simultaneous connected but I haven't counted in a while.)

Second, the scarce resource may also be low-latency network switching capacity. Are all the servers plugged into a single switch? For small worlds, that is fine. For bigger worlds, that starts becoming a problem -- while you can buy 480 port switches, the cabling starts becoming a problem! (Or you go virtual chassis and then you get into capacity/distance questions.)

Or do you go via a top-of-rack toplogy? Then the uplink capacity of TOR becomes the bottleneck. If you co-locate sim servers with view servers, you get less of that problem, and if one view server only needs to receive the world state of one sim server, you reduce the load by a factor N!

Random front-end selection is fantastic when the connection is largely stateless, and/or there is ONE source of truth, rather than an aggregate of many sources of truth. MMOs (both massive, and multiplayer, and online) are about as far from that ideal load condition as you can get, though.

So, again, I don't see any compelling argument FOR doing random load balancing, but I see (and have lived through) significant arguments AGAINST doing so, for simulated virtual worlds. This is why the two M-s in MMO is fundamentally different from most other network applications in the world. Most web sites, scientific computing, enterprise systems, and yes, even electronic markets, have different constraints.
enum Bool { True, False, FileNotFound };
Advertisement

First, each simulation server needs to provide the full state of the world to each view server. This is more than 50 bytes per player per tick.

To be very precise here: after initial synchronization, all that simulation server needs to provide is *updates* to the full state of the world. It means that the static stuff such as PC inventory, levels, relations, whatever-else doesn't need to be transferred on each tick. In fact, the only thing which really changes on each tick, is coordinates/velocities.

Given this, and given that 28-byte-IP+UDP-header-overhead-per-update doesn't apply to server-to-server communications, 50 bytes is more than enough for quite a few games out there (as always, YMMV). What do we need to send on each tick? Object ID (4 bytes), (x,y) as floats (that's 8 bytes total), probably (vx,vy) as floats too (another 8 bytes); angle - that's another 4-byte float (though actually in most cases it can be squeezed down to 1 byte, let's count it as 4 for now). z coordinate is usually simpler than that, but let's even assume it is (together with vz) is another 8 bytes, which brings us to a total of 32 bytes/player/tick. It means that we're almost 20 bytes below our declared 50 bytes, leaving us enough space for "some other stuff" (such as "crouching" flag or animation-frame-number). And we didn't even start to take into account that non-moving players are exempt from this calculation (and my guess is that there should be quite a few currently-static-players in your game, right?), and that certain parts like angle are actually changed much more rarely than that, which can allow for further trivial reductions if we feel like it. I've seen a 3D sim system which was in the range of 30 bytes/player/tick, myself (don't remember though if it involved any compression).

Moreover, even if for some game it is not 50 bytes , but even atrocious 500 bytes/player/tick (which, even if it happens, most likely would be a Really Bad case of being Really sub-optimal) - it won't change the overall picture that much (ok, it will be 100MBytes/sec per server, still not enough to kill the whole thing).


60 servers for 10,000 simultaneous online is ... not unreasonable, all in all, but you can do significantly better. (I think our current messaging cluster for my current place of work uses about 15 servers for > 100,000 simultaneous connected but I haven't counted in a while.)

As I've said, it depends on specifics of your game a lot. So, I'm not surprised that your numbers are 4x lower than I've wrote (especially for your type of relatively-static game, if comparing it to really-fast-paced MMOFPS stuff), but I've seen more than my estimate of 60 (and am pretty sure that there are games out there which need 5x more than that for all good reasons). This is even before we start accounting for implementation differences and (often atrocious) sub-optimalities (for exactly same game, I've seen 50-server farm handling 8x more players than a 400-server farm - that's 50x+ difference in efficiency, and heard about even worse cases of poor implementations).


Second, the scarce resource may also be low-latency network switching capacity. Ar

From my experience, network switches are the very last thing to cause any kind of trouble, whether it is reliability-, latency-, or performance-wise. I've never ever seen any problems in this regard (save for occasional duplex mismatch or broken cable, but that's beyond the point now). Moreover, I've never even *heard* of anybody having any problems with switches. I could get into official CISCO manuals to get hard numbers, but don't feel that there is a need for it. 300MBytes/sec for the whole switch (and that's in large packets, not in 60-byte game updates going outside) is that small, that it is very well beyond any realistic problems. Heck, it is even much less than one single 10Gbit/s link can handle.


So, again, I don't see any compelling argument FOR doing random load balancing, but I see (and have lived through) significant arguments AGAINST doing so, for simulated virtual worlds.

Your experience tells you this, and my experience tells me quite the opposite. However, as soon as we're into arguments about what constitutes "compelling" and "significant", we're IMHO in the realm which I've named "liking" above (and it Really Really depends on game specifics, on personal experiences, on specific implementations, and so on).

Again, what about agreeing to live in peace? :-)

For what it's worth, my experience both as a programmer for and player of MMO tend to favor hplus's view and disagree with no bugs view.

It is odd to see these posts talk about MMO and then have comments about a few hundred users or a few thousand users. MMO has a meaning. In an era when online games had a few hundred players concurrently, and a small number of them reached into the few thousand concurrent players, a new class of games reaching into tens of thousands of concurrent games emerged. They were not just multiplayer games, but massively multiplayer. The common delimiter between multiplayer and massively multiplayer is the sharp jump about 5,000 to 10,000 concurrent users. It reaches the point where you need a radical engineering change, moving from direct servers with simple front-ends to a high-performance server system. These days MMOs have grown into hundreds of thousands of concurrent users, but the bar to switch between regular multiplayer into massively multiplayer remains around the 5,000-10,000 concurrent user mark.

It is relatively easy to build a small server that handles a single-digit or double-digit number of concurrent players. Getting into the triple digits, a few hundred concurrent players, can take a bit of architecting to get right. But if you want to cross to the tens of thousands of concurrent players you are looking at a complete transformation of architecture.

For a multiplayer online game you can load balance as no bugs hare described. You're looking at a few thousand operations per second and it scales easily enough. But when you add a few zeros to the number of players, and you start dealing with the fact that each player has interactions with neighbors, when you cross over from 'multiplayer online' to 'massively multiplayer online', you can no longer scale that way.

Let's assume a world with 30 Game-World servers simulating 10000 players


I'm not surprised that your numbers are 4x lower than I've wrote


No, the capacity is 20x larger than you wrote. Unless you mean that the 30 servers are simulating 10,000 players *each* in which case I can guarantee that you don't understand the requirements of physical simulation.

But the other math you did (update bandwidth requirements) doesn't hold that out -- if you had 10,000 players per game server, so 300,000 players have to send updates to 30 view servers each tick at 30 Hz, and those updates are 50 bytes per player, that is way more data than the 100 MB/s you suggested above would be the limit at 500 bytes per player.

given that 28-byte-IP+UDP-header-overhead-per-update doesn't apply to server-to-server communications


This seems to me like a nonsensical statement. Packet overhead is packet overhead and doesn't change (unless you're on dial-up with Van Jacobsen compression.) Nobody in their right mind would send only a single update in a packet anywhere (be it to client, to server, or to view distributor) and I don't understand why you even bring it up.

It sounds to me like you have some experience with other kinds of scalable systems (which is fine,) and you've sniffed around the edges of some multiplayer online games, but you haven't actually built, deployed, and operated a massively multiplayer online game where players physically interact with each other, for real. That's fine as far as that goes. Your defense of random/stateless distribution for MMO games, beyond what actual facts show, doesn't seem motivated to me, and I wonder why you keep digging in like that. What's the actual benefit? What's in it for you?
enum Bool { True, False, FileNotFound };


But if you want to cross to the tens of thousands of concurrent players you are looking at a complete transformation of architecture.

What I'm trying to say, is that whatever-the-number-of-players-is, there is one architecture which will work pretty good across the board (and yes, this is give or take what you guys are already using). Yes, for 100 players you might get away with a different architecture (maybe even with something really silly as "1-thread-per-incoming-packet with mutex over game state"), but I don't really see any reasons to use it.

The question of Front-End Servers is relatively minor in the Big Schema of things (though personally I like them A LOT), and so is Load Balancing. While I am NOT saying that they're a "silver bullet" which applies everywhere, but I contend that they will work for quite a few games out there, even with hundreds of thousands of simultaneous players (heck, for a non-sim game I've done it myself a while ago). Yes, at those numbers for sim games there may be complications (like those "groups" I've mentioned above), but if you're starting with a reasonably good architecture (the one which is more or less what most of you guys are using anyway) - these complications won't make you rewrite the whole thing. However, the devil is in details, and it is important to establish the applicability limits of different technologies, this includes Front-End Servers and Client-Side Load Balancing. That's what I'm trying to do here, so if you have any specific concerns (like those hplus0603 has brought - and I think I've answered) - please go ahead, I do appreciate an opportunity to learn from smart and experienced people biggrin.png .

This topic is closed to new replies.

Advertisement