MMOs and modern scaling techniques

Ben Sizer · 2014-07-20T00:15:34

(NB. I am using MMO in the traditional sense of the term, ie. a shared persistent world running in real-time, not in the modern broader sense, where games like Farmville or DOTA may have a 'massive' number of concurrent players but there is little or no data that is shared AND persistent AND updating in real-time.) In recent discussions with web and app developers one thing has become quite clear to me - the way they tend to approach scalability these days is somewhat different to how game developers do it. They are generally using a purer form of horizontal scaling - fire up a bunch of processes, each mostly isolated, communicating occasionally via message passing or via a database. This plays nicely with new technologies such as Amazon EC2, and is capable of handling 'web-scale' amounts of traffic - eg. clients numbering the the tens or hundreds of thousands - without problem. And because the processes only communicate asynchronously, you might start up 8 separate processes on an 8-core server to make best use of the hardware. In my experience of MMO development, this is not how it works. There is a lot of horizontal scaling, but instead of firing up servers on demand, we pre-allocate them and tend to divide them geographically - both in terms of real world location so as to be closer to players, and in terms of in-game locations, so that characters that are co-located also share the same game process. This would seem to require more effort on the game developer's part but also imposes several extra limitations, such as making it harder to play with friends located overseas on different shards, requiring each game server to have different configuration and data, etc. Then there is the idea of 'instancing' a zone, which could be thought of as another geographical partition except in an invisible 4th dimension (and that is how I have implemented it in the past). MMOs do have a second trick up their sleeves, in terms of it being common to farm out certain tasks to various heterogeneous servers. A typical web app might just have many instances of the front-end server and one database (possibly with some cache servers in between), but in my experience MMOs will often have specific servers for handling authentication, chat and communications, accounts and transactions, etc. It's almost like extreme refactoring; if a piece of functionality can run asynchronously from the gameplay then it can be siphoned out into a new server and messaging to and from the game server set up accordingly. But in general, MMO game servers are limited in their capacity, so that you can typically only get 500-1500 players in one place. You can change the definition of 'place' by adding instancing and shards, you can make the world seem to hold more characters by seamlessly linking servers together at the boundaries, and you can increase concurrency a bit more via farming out tasks to special servers. So I wonder; are we doing it wrong? And more specifically, can we move to a system of homogeneous server nodes, created on demand, communicating via message passing, to achieve a larger single-shard world? Partly, the current MMO server architecture seems to be born out of habit. What started off as servers designed to accommodate a small number of people grew and grew until we have what we see today - but the underlying assumption is that a game server should (in most cases) be able to take a request from a client, process it atomically and synchronously, and alter the game state instantly, often replying at the same time. We keep all game information in RAM because that is the only way we can effectively handle the request synchronously. And we keep all co-located entities in the same RAM because that's the only way we can easily handle multiple-entity transactions (eg. trading gold for items). But does this need to be the case? My guess is that the main reason we can't move to a more distributed architecture comes partly down to latency but mostly down to complexity. If characters exist across an arbitrary number of servers, any action involving multiple characters is going to require passing messages to those other processes and getting all the responses back before proceeding. This turns behaviour that used to be a single function into either a coroutine (awkward in C++) or some sort of callback chain, also requiring error-detection (eg. if one entity no longer exists by the time the messages get processed) and synchronisation (eg. if one entity is no longer in a valid state for the behaviour once all the data is collected). This seems somewhat intractable to me - if what used to be a simple piece of functionality is now 3 or 4 times as complex, you're unlikely to get the game finished. And will the latency be too high? For many actions, I expect not, but for others, I fear it would. But am I wrong? Outside of games people are writing large and complex applications using message queues and asynchronous behaviour. My suspicion is that they can do this because they don't have a large amount of shared state (eg. world and character data). But maybe it's because they know ways to accomplish these tasks that somehow the game development community has either not become aware of or simply not been able to implement yet. Obviously there have been attempts to mix the two ideas, by running many homogeneous servers but attempting to co-locate all relevant data on demand so that the actual work can be done in the traditional way, by operating atomically on entities in RAM. On paper this looks like a great solution, with the only problem being that it doesn't seem to work in practice. (eg. Project Darkstar and various offshoots.) Sending the entities across the network so that they can be operated on appears to be like trying to send the mountain to Mohammed rather than him going to the mountain (ie. sending the message to the entity). What you gain in programming simplicity you lose in serialisation costs and network latency. A weaker version of this would be automatic geographical load balancing, I suppose. So, I'd like to hear any thoughts on this. Can we make online games more amenable to an async message-passing approach? Or are there fundamental limitations at play?

hplus0603

11,938

June 12, 2014 04:09 PM

Ping times are a source of complexity and gameplay challenges, but they are not a source of scalability problems.

Ping times for wired connections will not drop dramatically in the future, because they are bound by the speed of light -- current internet is already within a factor of 50% of the speed of light, so the maximum possible gains are quite well bounded.

enum Bool { True, False, FileNotFound };

fir

-460

June 12, 2014 05:06 PM

Ping times are a source of complexity and gameplay challenges, but they are not a source of scalability problems.

Ping times for wired connections will not drop dramatically in the future, because they are bound by the speed of light -- current internet is already within a factor of 50% of the speed of light, so the maximum possible gains are quite well bounded.

does the relativity efects show in the gameplay ? ;/

i understand that "showing" of the problems is the response delay

that server give to the client, and the art is to keep it low below some treshold.. could maybe someone know how the treshold is (is this a sum

of time of sending info to server + server processing time + time of sending response to the client?) how values are this to feel tha game is really fine?

(sorry for basic questions in more advanced thread but if I get an opportunity i wold like to understand things a bit, maybe also some thoughts will appear ;/)

Waterlimon

4,401

June 12, 2014 05:19 PM

Ping times are a source of complexity and gameplay challenges, but they are not a source of scalability problems.

Ping times for wired connections will not drop dramatically in the future, because they are bound by the speed of light -- current internet is already within a factor of 50% of the speed of light, so the maximum possible gains are quite well bounded.

does the relativity efects show in the gameplay ? ;/

i understand that "showing" of the problems is the response delay

that server give to the client, and the art is to keep it low below some treshold.. could maybe someone know how the treshold is (is this a sum

of time of sending info to server + server processing time + time of sending response to the client?) how values are this to feel tha game is really fine?

(sorry for basic questions in more advanced thread but if I get an opportunity i wold like to understand things a bit, maybe also some thoughts will appear ;/)

Since there is always a delay in the information passing through the server (the ping), in order to not show it to the player, you need to either:

-Show the server answer to the player after a set time X.

**Play some animation of length X before showing the player whether you succeeded in something or not (which is dictated by server)

**The player cannot tell the lag because the only way to tell lag is the time it takes for the server to answer, and we have hidden that information

**This breaks down if the ping gets higher than X

-Predict the servers answer accurately

**This can probably never work to 100% accuracy, but it can be used to hide most of the lag resulting from the ping

**Eg predict actions of other players at current time, although you cannot know this until X seconds later because this info comes from the server (and thus has delay)

**You can also predict the result of your own actions if an authoritative response is required from server. Eg if you open a chest, the client can ASSUME that theres nothing in there (with the clients luck), which is most often correct, to hide the ping, but this might be wrong and then has to be corrected when the 'real' information is available.

So you can either hide the lag of obtaining information, or predict the information before obtaining it.

o3o

Tribad

981

June 12, 2014 05:25 PM

does the relativity efects show in the gameplay ? ;/

Absolutly. Mostly because of mass increase of the data packages starting to be significant at about one third of lightspeed. If you throw a stone onto another player it will magically produce more hitpoints. This is why many people change place into countries that are closer to the game servers.

fir

-460

June 12, 2014 05:56 PM

@up

Im curious what are the response times of todays internet infrastructure,

assume such model: world and a 1000 players no it

there is NOW frame, everything is perfect ok on the map, each player moves and sends its new position to the server, it takes various times to them +10 +30 +70 ms (i dont know)

When server will receive the last position we can count the FUTURE frame (also perfect) then w send this future state back to the players

now we can do it again

(this is a kind of my imaginary model but probably can be realized)

this kind of working will "pulse" with the frequenzy of more laggish player,

so lets take an approach that when some most lagish players will delay

after some treshold we can throw them off out the game

I wonder how theshold in milliseconds can be set that will allow to alive at least a half of most speed connections? (I got no idea as i am not programming network or even not playing network games)

It would be interesting to me estimate this time. I worry that if each sending times has some variation (i mean like gaus 10+-300) this killing connection approach will kill most of the players - but wold very like to

get some idea how many players will be survive in such a test

(i know games uses some techniques that allow masking delays etc

but would be curious how much it wold take in raw perfect sate of things)

has someone as they say some "intuition" how many players would stay alive at which delay treshold here?

fir

-460

June 12, 2014 06:07 PM

does the relativity efects show in the gameplay ? ;/

Absolutly. Mostly because of mass increase of the data packages starting to be significant at about one third of lightspeed. If you throw a stone onto another player it will magically produce more hitpoints. This is why many people change place into countries that are closer to the game servers.

as far as i know variety effect could appear, for example when you will send the data with near light spead and it will bac the data should be much younger than the data that was staying at home

VFe

120

June 12, 2014 06:54 PM

Ping times are a source of complexity and gameplay challenges, but they are not a source of scalability problems.

Ping times for wired connections will not drop dramatically in the future, because they are bound by the speed of light -- current internet is already within a factor of 50% of the speed of light, so the maximum possible gains are quite well bounded.

I'm not sure if I'm misreading you. But I feel like what you said is very misleading. The real world performance of network infrastructure is not even slightly approaching 50% light. We typically max at 20% in best case scenarios.

The majority of transit time is eaten up by protocol encoding/decoding in hardware, and improving the hardware or the protocol can dramatically increase transit latency. Ex. Going from tcp to infinband inside a cluster can reduce latency from 2milliseconds to nanoseconds.

Not saying it's practical by any means, but we're bound by switches/protocols far more than light.

Matias Goldberg

9,638

June 12, 2014 06:56 PM

I'm not experienced with MMOs at all, however I am with scalability, games and networking; and I have a couple of comments.

In recent discussions with web and app developers one thing has become quite clear to me - the way they tend to approach scalability these days is somewhat different to how game developers do it. They are generally using a purer form of horizontal scaling - fire up a bunch of processes, each mostly isolated, communicating occasionally via message passing or via a database. This plays nicely with new technologies such as Amazon EC2, and is capable of handling 'web-scale' amounts of traffic - eg. clients numbering the the tens or hundreds of thousands - without problem. And because the processes only communicate asynchronously, you might start up 8 separate processes on an 8-core server to make best use of the hardware.

I wouldn't put so much trust on "how web developers approach scalability".
Not too long ago, we had the C10K problem. Browsing through the net, recommendations were "just use fork, it scales very well". Turns out, fork had an initialization overhead (which like you said, you preallocate to prevent this problem). Then they said "use one socket per thread, super scalable! The Linux Kernel does magic for you. Thus select and poll were recommended" and someone digged up the Linux Kernel src and found that the Kernel linearly walks through the list of sockets to know which process/thread needs to be delivered. Some of the algorithms had O(N^2) complexity.
Someone fixed this, and then we got epoll.

But still the problem remains that we have one socket per TCP connection and that sucks hard. On a a C10M problem article points out these problems, and points at the driver stack as the biggest bottleneck.

The first time I deeply digged into scalable networking for a client project, I met this bizarre architecture flaw and no one talked about it as if there was no problem at all. Then that C10M article appeared, and I was relieved to hear a voice that finally someone pointed the same problems I saw.

So, no. I don't trust the majority of web developers in doing highly scalable web development. Most of the time they just get lucky their servers don't get enough stress to (D)DoS. But my gut is that if they were better at that job, they could handle the same server load with far less farm budget.

Sure, at a very high level with distributed servers like Amazon EC2, these paradigms work. But beware that a user waiting 5 second for the search results of their long-lost friend on Facebook is acceptable(*). A game with a 5 second lag for casting a spell is not.
Half an hour of delay until my APK gets propagated across Google Play Store servers is acceptable and reasonable. Half an hour of delay until my newly created character gets propagated so I can start playing is not.

(*) Many giants (i.e. Google, Amazon) are actively working on solutions as Amazon (or was it Apple) found out a couple milliseconds improvement in page loading correlated with higher sales.

Web content has a much higher consumption rate than production rate. Games have the annoying property that have both frequent read and write access to everything (you can mitigate by isolating, but there's a limit).
Frequent write access hinders task division, which is necessary for scaling across cores/machines/people/whatever.

Twitter: @matiasgoldberg

Distant Souls ? Alliance AirWar ? My Free Royalty-Free Music Library

Kylotan

Author

10,513

June 12, 2014 10:05 PM

So this is an interesting topic actually. The trend is to move reliability back up to the layer that defined the need in the first place, instead of relying on a subsystem to provide it.

Just because the network layer guarantees the packets arrive doesn't mean they get delivered to the business logic correctly, or processed correctly. If you think 'reliable' udp or tpc makes your system reliable, you are lying to yourself.

http://www.infoq.com/articles/no-reliable-messaging

http://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.txt

http://doc.akka.io/docs/akka/2.3.3/general/message-delivery-reliability.html

Ok, but I'm still not following. Of course reliable transport doesn't mean correct business logic. But these are 2 separate issues.

Your first link basically makes the argument for application-level sequence numbers because it wants to preserve business logic even when the transport is down. That's reasonable, but it's not a key concern for most games. If game server 1 has lost its connection to game server 2 for some reason, you probably have bigger problems than making sure Jimmy's Magic Missile is propagating across servers properly. There's no guarantee you can recover in a meaningful way so it is arguably best to abort entirely. The exception would be for any sort of real-money transaction or similar, but as I hope was made clear in earlier posts, I'm happy with those being done in a more complex and yet more robust way, as they are a small minority of what a game server normally handles.

The other two links elaborate on this and say things like "The only meaningful way for a sender to know whether an interaction was successful is by receiving a business-level acknowledgement". That's fine, but not necessarily relevant. As above, a reliable transport layer will give you (short of astronomically unlikely occurrences) a guarantee one of two things happened:

The message arrived at the intended place in an undamaged form. All is well if you coded the receiving method properly.
The message didn't arrive, and your game is therefore broken.

With this in mind, the reliability guarantee given by the OS is going to be sufficient for any typical gameplay functionality.

So the problem with a pure message-passing approach remains that of whether it is practical to code all gameplay features to work in that way, given that it's not necessary.

Kylotan

Author

10,513

June 12, 2014 10:15 PM

First: Try writing a robust real-time physics engine that can support even a small number like 200 players with vehicles on top of the web architecture that @snacktime and @VFe describe. I don't think that's the right tool for the job.

I'd be inclined to agree.

I was going to add the caveat of no real-time Newtonian physics, but I left that out to keep things simple. Perhaps that was a mistake.

You'd think that, given how much has been said on the matter, that there would be at least one instance of people talking about using different methods, but I've not seen one.

Personally, I've actually talked a lot about this in this very forum for the last ten-fifteen years. For reference, the first one I worked on was There.com, which was a full-on single-instance physically-based virtual world. It supported full client- and server-side physics rewind; a procedural-and-customized plane the size of Earth; fully customizable/composable avatars with user-generated content commerce; voice chat in world; vehicles ridable by multiple players, and a lot of other things; timeframe about 2001. The second one (where I still work) is IMVU.com where we eschew physics in the current "room based" experience because it's so messy. IMVU.com is written almost entirely on top of web architecture for all the transactional stuff, and on top of a custom low-latency ephemeral message queue (written in Erlang) for the real-time stuff. Most of that's sporadically documented in the engineering blog: http://engineering.imvu.com/

I have seen you post a lot about the There.com architecture and in general the games I've worked on have shared many of its characteristics, so that side is known to me.

However I don't recall seeing much detail on the IMVU stuff. I've started taking a look at that link (thanks) but I've skimmed through the whole thing and I can't see anything specific to the handling of state changes, or multi-node transactions, or decomposing complex behaviour into messages, etc.

MMOs and modern scaling techniques

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

MMOs and modern scaling techniques

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines