MMOs and modern scaling techniques

Ben Sizer · 2014-07-20T00:15:34

(NB. I am using MMO in the traditional sense of the term, ie. a shared persistent world running in real-time, not in the modern broader sense, where games like Farmville or DOTA may have a 'massive' number of concurrent players but there is little or no data that is shared AND persistent AND updating in real-time.) In recent discussions with web and app developers one thing has become quite clear to me - the way they tend to approach scalability these days is somewhat different to how game developers do it. They are generally using a purer form of horizontal scaling - fire up a bunch of processes, each mostly isolated, communicating occasionally via message passing or via a database. This plays nicely with new technologies such as Amazon EC2, and is capable of handling 'web-scale' amounts of traffic - eg. clients numbering the the tens or hundreds of thousands - without problem. And because the processes only communicate asynchronously, you might start up 8 separate processes on an 8-core server to make best use of the hardware. In my experience of MMO development, this is not how it works. There is a lot of horizontal scaling, but instead of firing up servers on demand, we pre-allocate them and tend to divide them geographically - both in terms of real world location so as to be closer to players, and in terms of in-game locations, so that characters that are co-located also share the same game process. This would seem to require more effort on the game developer's part but also imposes several extra limitations, such as making it harder to play with friends located overseas on different shards, requiring each game server to have different configuration and data, etc. Then there is the idea of 'instancing' a zone, which could be thought of as another geographical partition except in an invisible 4th dimension (and that is how I have implemented it in the past). MMOs do have a second trick up their sleeves, in terms of it being common to farm out certain tasks to various heterogeneous servers. A typical web app might just have many instances of the front-end server and one database (possibly with some cache servers in between), but in my experience MMOs will often have specific servers for handling authentication, chat and communications, accounts and transactions, etc. It's almost like extreme refactoring; if a piece of functionality can run asynchronously from the gameplay then it can be siphoned out into a new server and messaging to and from the game server set up accordingly. But in general, MMO game servers are limited in their capacity, so that you can typically only get 500-1500 players in one place. You can change the definition of 'place' by adding instancing and shards, you can make the world seem to hold more characters by seamlessly linking servers together at the boundaries, and you can increase concurrency a bit more via farming out tasks to special servers. So I wonder; are we doing it wrong? And more specifically, can we move to a system of homogeneous server nodes, created on demand, communicating via message passing, to achieve a larger single-shard world? Partly, the current MMO server architecture seems to be born out of habit. What started off as servers designed to accommodate a small number of people grew and grew until we have what we see today - but the underlying assumption is that a game server should (in most cases) be able to take a request from a client, process it atomically and synchronously, and alter the game state instantly, often replying at the same time. We keep all game information in RAM because that is the only way we can effectively handle the request synchronously. And we keep all co-located entities in the same RAM because that's the only way we can easily handle multiple-entity transactions (eg. trading gold for items). But does this need to be the case? My guess is that the main reason we can't move to a more distributed architecture comes partly down to latency but mostly down to complexity. If characters exist across an arbitrary number of servers, any action involving multiple characters is going to require passing messages to those other processes and getting all the responses back before proceeding. This turns behaviour that used to be a single function into either a coroutine (awkward in C++) or some sort of callback chain, also requiring error-detection (eg. if one entity no longer exists by the time the messages get processed) and synchronisation (eg. if one entity is no longer in a valid state for the behaviour once all the data is collected). This seems somewhat intractable to me - if what used to be a simple piece of functionality is now 3 or 4 times as complex, you're unlikely to get the game finished. And will the latency be too high? For many actions, I expect not, but for others, I fear it would. But am I wrong? Outside of games people are writing large and complex applications using message queues and asynchronous behaviour. My suspicion is that they can do this because they don't have a large amount of shared state (eg. world and character data). But maybe it's because they know ways to accomplish these tasks that somehow the game development community has either not become aware of or simply not been able to implement yet. Obviously there have been attempts to mix the two ideas, by running many homogeneous servers but attempting to co-locate all relevant data on demand so that the actual work can be done in the traditional way, by operating atomically on entities in RAM. On paper this looks like a great solution, with the only problem being that it doesn't seem to work in practice. (eg. Project Darkstar and various offshoots.) Sending the entities across the network so that they can be operated on appears to be like trying to send the mountain to Mohammed rather than him going to the mountain (ie. sending the message to the entity). What you gain in programming simplicity you lose in serialisation costs and network latency. A weaker version of this would be automatic geographical load balancing, I suppose. So, I'd like to hear any thoughts on this. Can we make online games more amenable to an async message-passing approach? Or are there fundamental limitations at play?

Networking and Multiplayer Programming

Started by Kylotan June 10, 2014 01:26 PM

65 comments, last by wodinoneeye 10 years, 4 months ago

hplus0603

11,938

June 17, 2014 09:49 PM

Redis will have that issue, Riak , Cassanda, Dynamo won't.

I have two questions:

1) Have you actually tried it? I've measured those document stores pretty extensively, and I concluded they're not up to that task. Probably because they're not designed for that task. You might as well say that Oracle and DB/2 can scale here, too, because they also support distributed transactions.

2) If you get "horizontal scalability" but at the cost of 1,000x less individual efficiency, then do you think that a business based on that will actually survive? The reason it works for web is that 99.9% of the time, cross-user interaction at low latency isn't important. For games, 99.9% of the time, it is important.

Or, to put it another way: Do you think that Google Talk/Hangouts, or Microsoft Messenger, or Skype, use "web technologies" for the real-time chat parts? Do you think the brokering of an AIM message goes through Cassandra?

enum Bool { True, False, FileNotFound };

VFe

120

June 18, 2014 12:06 AM

Redis will have that issue, Riak , Cassanda, Dynamo won't.

I have two questions:

1) Have you actually tried it? I've measured those document stores pretty extensively, and I concluded they're not up to that task. Probably because they're not designed for that task. You might as well say that Oracle and DB/2 can scale here, too, because they also support distributed transactions.

2) If you get "horizontal scalability" but at the cost of 1,000x less individual efficiency, then do you think that a business based on that will actually survive? The reason it works for web is that 99.9% of the time, cross-user interaction at low latency isn't important. For games, 99.9% of the time, it is important.

Or, to put it another way: Do you think that Google Talk/Hangouts, or Microsoft Messenger, or Skype, use "web technologies" for the real-time chat parts? Do you think the brokering of an AIM message goes through Cassandra?

#1. Yes. Up to what task specifically? Speaking of Riak, since I have the most experience with it. Assuming you have adequate provisioning, it's fast enough to write/read within a single frame from server <--> ring, several; depending on how your cluster is set up, if you're using TCP, PB, or a custom Infiband driver. We're not talking concrete numbers here and that significantly impacts the nature of the discussion we're having, so please use some if you're going to attempt to refute every claim. This discussion requires context, state your perception of "loose physics", because it's possible given the correct constraints.

Oracle <-> Riak/Cassandra/Dynamo is a false equivalence, Distributed transactions in Oracle are about forwarding across data layers, Distribution across a Riak cluster is about providing fault tolerance and improving read access through denormalization. So no, I may very well not say that, because it is false.

#2. You're providing an extremely exaggerated number bordering on reductio ad absurdum. In the context of comparing it to Redis. It's an order of 2-3x average case, order of magnitude in worse cases, without application specific optimizations. Which again, depending on the latency tolerance of a particular application, may very well be acceptable in exchange for the ability to scale to hundreds of thousands to millions of users efficiently(at least as far as storage/retrieval go). If you're business teetering on the brink because managing 10 servers versus 3-4 is the breaking point(or 100 vs 40...or whatever scale you choose for that ratio), then your problems as a business are elsewhere, in marketing, sales, market fit...etc. Cause operation costs should not impact your survivability at those ratios.

As to the last point, I'm not even trying to argue that, but the question is "How much?" If your latency tolerance is above X, then these techniques can work. If it's not, than they won't, and I've never claimed otherwise. So you have to determine what that number is for your application before you can discuss techniques.

#3. No of course I don't. I fail to see the value in this assertion, you're making an absurd claim as to what I believe is feasible by comparing voice/video systems that have hard lock-step synchronized real-time constraints vs the soft real-time constraints we're discussing and that are common to a lot of games(including traditional MMOs).

I'm not making any claims that the things I've said are the best techniques in the world for every occasion, but you seem to be arguing that because they don't fit a very specific criteria for particular type of undefined game, that their value is near useless. Which is "throwing the baby out with the bathwater".

Frankly, I no longer think this is a constructive conversation amongst peers, and will avoid further participation due to professionalism.

hplus0603

11,938

June 18, 2014 03:41 PM

Yes. Up to what task specifically?

Running a distributed physical simulation for an MMO, as stated in the initial question in this thread: "a shared persistent world running in real-time."

Oracle <-> Riak/Cassandra/Dynamo is a false equivalence, Distributed transactions in Oracle are about forwarding across data layers

While Oracle and DB/2 can do that, distributed transactions across partition tables are explicitly about providing horizontal scale-out.

You're providing an extremely exaggerated number bordering on reductio ad absurdum

No, I am not. I have profiled this extensively. Redis supports about 1/1000th as many native operations per second over the network as a dedicated physics simulation server does. So, the question is not one of "3-4 versus 10," the quesiton is one of "1 versus 1000." And Redis still doesn't scale out correctly for the use case we discuss in this thread.

Why do I talk about Redis, and not, as you suggest, about Riak? Because I did a lot of benchmarking of Riak versus Redis versus a number of other databases, specifically to support a soft-latency online environment, and Riak came second-to-last whereas Redis did about as well as possible given the circumstances.

you're making an absurd claim as to what I believe is feasible by comparing voice/video systems that have hard lock-step synchronized real-time constraints vs the soft real-time constraints we're discussing

Text chat has no hard real-time constraints. A shared persistent world in real-time has more of those constraints. However, I also feel as if you're not reading the words that I actually write, and are tilting against some windmill that I didn't put there. That may be on me, though.

you seem to be arguing that because they don't fit a very specific criteria for particular type of undefined game, that their value is near useless. Which is "throwing the baby out with the bathwater".

Frankly, I no longer think this is a constructive conversation amongst peers, and will avoid further participation due to professionalism.

I think I've been pretty clear about what I know is needed for persistent world simulations, as defined in the beginning of this thread, and what my experience actually building these systems has been. My guess is that you have not actually built any distributed simulation system where users push buttons and characters move on the screen, but that's just an impression I get from your description; it'd be great to know for sure.

I will let the seeming ad hominem attack and storming out of the room stand for itself.

enum Bool { True, False, FileNotFound };

starbasecitadel

705

June 19, 2014 04:47 PM

I'm trying to understand what problem being discussed here is.

Is it basically in a game like WoW where there is some big event and half the Alliance players on an instance all decide to meet in a single shopkeeper's room in Stormwind? So you have to manage an exponentially (? or at least non-linear) increasing number of interactions?

eg:

1 player = 0 interactions

2 players = 2 interactions

3 players = 6 interactions

4 players = 12 interactions

5 players = 20 interactions

6 players = 30 interactions

....... non-linear scaling ....

My only comment here isn't from a technical point of view but from a playability point of view. After a certain number of players, does it really matter if things aren't being perfectly computed? For example if 20,000 players are in a single bar, you wouldn't even be able to see anything at all if it was rendered "correctly". It might be a better experience for players if you just show a random say 30 or even 200 players maximum and it is what it is.

hplus0603

11,938

June 19, 2014 08:46 PM

So you have to manage an exponentially

Because you asked: the growth you illustrate is known as "polynomial" (n-squared in this case) and, yes, that is the main source of problems in this case.

It might be a better experience for players if you just show a random say 30 or even 200 players maximum and it is what it is.

I tried that in 2004, and it didn't lead to a good gameplay experience.

Problems we ran into include:
- Which 30 or 200 players do you include, versus excluding the rest?
- How much hysteresis do you include in the visible set, to avoid overloading the clients with constant entity visible/invisible messages? (and note that any degree of user customization will make entity-visible events very expensive the first time, and it's the first time that matters, not the cached-best-case.)
- Do you need to see the same things as your team/group members?
- Are there special extra-important actors?
- What if there are actors that do affect you or your team members physically, and you're not "seeing" them?

The actual, end-user gameplay result is terrible if you get these wrong, and even if you get them right, but what you "see" is the nearest 50 players in a small circle around you, plus your team mates, and the rest of the plaza/square/zone looks empty, it's not a good experience.

Separately, there's the question of how the servers solve this problem -- the servers have to "see" all actors, and the n-squared potential for interactions is much more urgent in a gameplay scenario than in a traditional web server scenario, so traditional "web scale" solutions are not a great match for these types of situations. (They work great for Farmville or LoL though, which were explicitly non-goals of this discussion in the original question.)

enum Bool { True, False, FileNotFound };

crancran

507

July 15, 2014 02:14 AM

After a certain number of players, does it really matter if things aren't being perfectly computed? For example if 20,000 players are in a single bar, you wouldn't even be able to see anything at all if it was rendered "correctly". It might be a better experience for players if you just show a random say 30 or even 200 players maximum and it is what it is.

Be careful not to confuse client-side rendering concerns with Kylotan's server-side distributed focus.

In your situation, its still important to render the world as it exists on the server within the client but the client can impose some rules which simplify the rendering phase but still doesn't impair the player's ability to find their friend, trade with them, converse via say, whisper, etc or even inspect their gear or what not.

If we were to consider WoW and consider Kylotan's questions, several systems quickly popup that likely are implemented as separate processes on the server side.

The first service is chat. The actual game client opens a separate connection to a chat service when you enter the game world. This has two immediate benefits on the server side architecture as that the world servers don't have to worry with any asynchronous operations between processes for chat, and secondly that chat is not a concern of those servers at all. The chat servers themselves now bear the brunt of logging and any other operation necessary that revolve around chat, including resolving who gets communication sent over /yell and /say based on proximity and distributed communications in public and private channels.

Another distributed service is likely something regarding player inventory. With the inception of cross-realm services, your game object is interacting with other game objects from other shards in a single hosted world simulation. But your inventory is still managed by your local realm shard. Assuming we both played on the same realm and are standing in Goldshire, it's possible that area of the world is hosted by another realm which isn't our home realm, thanks to cross-realms. Now I go to trade you a pet I recently captured from a pet battle. The realm that Goldshire is hosted on will receive the trade request, distribute that to the inventory service for our home realm and thus deduct the item from my inventory and add it to yours.

These are two overly simplified examples, but hopefully it puts Kylotan's questions into a better perspective for you.

worldissounreal

July 15, 2014 01:04 PM

https://github.com/TrinityCore/TrinityCore/blob/master/src/server

any questions? i mean it does its job well handling thousands players at a time, rarely we need more players in a single realm

this one is relatively simple, blizzard uses different servers for example for battle.net communication and for instances, i heard alot of stories from players jumping in a molten core just to get the 'instance servers are full' notification

Tribad

981

July 15, 2014 01:11 PM

Yes. What is it? And what has it todo with the topic?

samoth

9,833

July 15, 2014 02:33 PM

Regarding speed of light, it should be noted that most long-range cables are not copper but fiber-optic, which are even slower, and they're not going to get much faster. The most modern low-latency transatlantic cables have an alleged latency of 60ms (I measure between 75 and 78ms for the transatlantic part with Level3, but they might not treat me precisely as their most important customer, it is probably possible to get QoSed better!), which puts them, assuming roughly a 5500km distance, at about 29% the speed of light in vacuum.

It is not easy (maybe not even possible) to make the light inside those cables travel much faster, for two reasons (actually both are kind of the same reason). Those cables transmit light by total internal reflection. Which implies that:

(a) the optical medium has hard constraints on its index of refraction (otherwise no reflection will happen), and the index of refraction directly affects the speed of light and

(b) the light ray bounces zig-zag inside the cable, making the distance travelled considerably longer than it actually is

As for protocols, they do add overhead, but I think there isn't that much to strip off. Taking the InfiniBand example from earlier and ignoring the fact that InfiniBand is a kind of DMA-over-wire (so, maybe suitable for a cluster with a private network but surely not for communicating over the internet).

Say that we are indeed able to completely strip off 2 ms inside your computer by using InfiniBand (or whatever). That's still very little compared to 60ms for going through Ye Bigge Wyre over Ye Altanteeque. Or compared to going from Oakland to New York, or anything the like.

Routers already operate in the considerably-below-millisecond range anyway (right where I'm sitting at this moment, tracert tells me that it takes between 24 and 25ms for the packet to make it through ATM, and little over 1ms total for making 4 hops to DECIX (about 18km distance) and over to a datacenter which is 224km away. Which is a pretty awesome latency on behalf of the backbone (especially considering that they're routing billions of packets while I'm typing this). It takes around 700µs for the signal to travel that distance at theoretical copper-wire speeds, which puts "little over 1ms alltogether" into the "woah, fucking heroic" range for the involved routers. Not much to strip off here, really.