Advertisement

MMOs and modern scaling techniques

Started by June 10, 2014 01:26 PM
65 comments, last by wodinoneeye 10 years, 4 months ago

Distributed archtiectures are the right tool for the job. What I've seen is that the game industry is just not where most of the innovation with server side architectures is happening, so they are a bit behind.


You are absolutely correct that game developers often don't even know that they don't know how to do business back ends effectively. Game technologies in general, and real-time simulations in particular, are like race cars running highly tuned engines on a highly controlled race track. Business systems are like trucks carrying freight over a large network of roads. Game developers, too often, try to carry freight on race cars.

Race cars are great at what they do, though. That's the bit that is fundamentally different from the kind of business object, web architectures that you are talking about. (And I extend that to not-just HTTP; things like Storm or JMS or Scalding also fits in that mold.) When you say that "the optimization of a single server doesn't matter," it tells me that you've never tried to run a real physics simulation of thousands of entities, that can all interact, in real time, at high frame rates. There are entire classes of game play and game feel that are available now, that were not available ten years ago, explicitly because you can cram more processing power and higher memory throughput into a single box. A network is, by comparison, a high-latency, low-bandwidth channel. If you're not interested in those kinds of games, then you wouldn't need to know about this, but the challenge we discuss here is games where this matters just as much as the ability to provide consistent long-term data services to lots of users.

I see, equally often, business-level engineers, and web engineers, who are very good at what they do, failing to realize the demands of game-type realtime interactive simulation. (Or, similarly, multimedia like audio -- check out the web audio spec for a real train wreck some time.)

A really successful, large scale company, has both truckers and race car drivers, and get them to talk to each other and understand each other's unique challenges.
enum Bool { True, False, FileNotFound };

This is basically the discussion that led me to post here - two sides both basically saying "I've done it successfully, and you can't really do it the way that the other people do it". Obviously this can't be entirely true. smile.png I suspect there is more to it.

Let me ask some more concrete questions.


No shared state, messages are immutable, and there is no reliability at the message or networking level.

Ok - but there is so much in game development that I can't imagine trying to code in this way. Say a player wants to buy an item from a store. The shared state model works like this:


BuyItem(player, store, itemID):
    Check player has enough funds, abort if not
    Check itemID exists in store inventory, abort if not
    Deduct funds from player
    Add funds to store
    Add instance of itemID to player inventory
    Commit player funds and inventory to DB
    Notify player client and any observing clients of purchase

This is literally a 7-line function if you have decent routines already set up. Let's say you have to do it in a message-passing way, where the store and player are potentially in different processes. What I see - assuming you have coroutines or some other syntactic sugar to make this look reasonably sequential rather than callback hell - is something like this:


BuyItem(player, store, itemID):
    Check player has enough funds, abort if not
    Ask store if it has itemID in store inventory. Wait for response.
    If store replied that it did not have the item in inventory:
        abort.
    Check player STILL has enough funds, abort if not
    Deduct funds from player
    Tell store to add funds. Wait for response.
    If store replied that it did not have the item in inventory:
        add funds to player
        abort
    Add instance of itemID to player inventory
    Commit player funds to DB
    Notify player client and any observing clients of purchase

This is at least 30% longer (not counting any extra code for the store) and has to include various extra error-checks, which are going to make things error-prone. I suspect it gets even more complex when you try and trade items in both directions because you need both sides to be able to place items in escrow before the exchange (whereas here, it was just the money).

So... is there an easier or safer way I could have written this? I wouldn't even attempt this in C++ - without coroutines it would be hard to maintain the state through the routine. I suppose some system that allows me to perform a rollback of an actor would simplify the error-handling but there are still more potential error cases than if you had access to both sides of the trade and could perform it atomically.

You talk about "using an actor with a FSM", but I can't imagine having to write an FSM for each of the states in the above interaction. Again, comparing that to a 7-line function, it's hard to justify in programmer time, even if it undoubtedly scales further. I appreciate something like Akka simplifies both the message-passing and the state machine aspects, so there is that - but it's still a fair bit of extra complexity, right? (Writing a message handler for each state, swapping the message handler each time, stashing messages for other states while you do so, etc.)

Maybe you can generalise a bit - eg. make all your buying/selling/giving/stealing into one single 'trade' operation? Then at least you're not writing unique code in each case.

As for "writing the code so all messages are idempotent" - is that truly practical? I mean, beyond the trivial but worthless case of attaching a unique ID to every message and checking that the message hasn't been already executed, of course. For example, take the trading code above - if one actor wants to send 10 gold pieces to another, how do you handle that in an idempotent way? You can't send "add 10 gold" because that will give you 20 if the message arrives twice. You can't send "set gold to 50" because you didn't know the actor had 40 gold in the first place.

Perhaps that is not the sort of operation you want to make idempotent, and instead have the persistent store treat it as a transaction. Fair enough, and the latency wouldn't matter if you only do this for things that don't occur hundreds of times per second and if your language makes it practical. (But maybe there aren't all that many such routines? The most common one is movement, and that is easily handled in an idempotent way, certainly.)

Forgive my ignorance if there is a simple and well-known answer to this problem; it's been a while since I examined distributed systems on an academic level.

Advertisement
There's a lot of confusion and mistruth surrounding the way successful MMOs tend to be implemented. This can be attributed to any number of causes, but I think the bottom line is twofold:

1. Successful MMO developers know a lot more about distributed scale than the "hrrr drrr web-scale" crowd tends to realize.
2. Successful MMO developers rarely divulge all the secrets to their success. This feeds into Point 1.


These are hard problems, no doubt. It takes truly excellent engineering to solve them. People who claim to have found solutions but can't point to shipped and operational code are almost certainly in for a rude surprise when they actually try and put millions of players on their systems.

Scale is a very nonlinear thing. A lot of people intuit scale as being roughly linear... a hundred people is twice as many as fifty people, right? The fact is, you can get truly bizarre behavior with high-scalability systems. You might go from thousands of connections and 2% CPU usage and 90% free RAM to all of your hardware deadlocked and overheating by just adding a few dozen connections. You might get things that run faster by some metrics when loaded up with players. And so on.

Scalable distributed software is like quantum physics. If you think you can intuit what's going on, you're fucking delusional, period. This stuff is notoriously difficult and messy.


To (sort of, not really) answer the concrete questions that were posed earlier...

Yes, you need to make virtually all messages and transactions idempotent. The exceptions are fairly boring cruft, like pings and keep-alives and whatnot. But be careful if you assume that a ping/keep-alive isn't going to become a potential source of performance pain.

And no, you don't need to write a huge amount of extra code. Much of it can be abstracted and factored into reusable components fairly easily. You only need to write three-phase commit once, and then apply that routine to your transactions going forward; and so on. If you design carefully, you can eliminate a lot of the redundant cruft, and most of your code reads like a DSL that speaks in terms of messages and transactions.

That said, it does take writing it the hard way once or twice to learn where you can refactor effectively; I don't know of any shortcuts to that.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

This is basically the discussion that led me to post here - two sides both basically saying "I've done it successfully, and you can't really do it the way that the other people do it". Obviously this can't be entirely true. smile.png I suspect there is more to it.

Let me ask some more concrete questions.


No shared state, messages are immutable, and there is no reliability at the message or networking level.

Ok - but there is so much in game development that I can't imagine trying to code in this way. Say a player wants to buy an item from a store. The shared state model works like this:


BuyItem(player, store, itemID):
    Check player has enough funds, abort if not
    Check itemID exists in store inventory, abort if not
    Deduct funds from player
    Add funds to store
    Add instance of itemID to player inventory
    Commit player funds and inventory to DB
    Notify player client and any observing clients of purchase

This is literally a 7-line function if you have decent routines already set up. Let's say you have to do it in a message-passing way, where the store and player are potentially in different processes. What I see - assuming you have coroutines or some other syntactic sugar to make this look reasonably sequential rather than callback hell - is something like this:


BuyItem(player, store, itemID):
    Check player has enough funds, abort if not
    Ask store if it has itemID in store inventory. Wait for response.
    If store replied that it did not have the item in inventory:
        abort.
    Check player STILL has enough funds, abort if not
    Deduct funds from player
    Tell store to add funds. Wait for response.
    If store replied that it did not have the item in inventory:
        add funds to player
        abort
    Add instance of itemID to player inventory
    Commit player funds to DB
    Notify player client and any observing clients of purchase

This is at least 30% longer (not counting any extra code for the store) and has to include various extra error-checks, which are going to make things error-prone. I suspect it gets even more complex when you try and trade items in both directions because you need both sides to be able to place items in escrow before the exchange (whereas here, it was just the money).

So... is there an easier or safer way I could have written this? I wouldn't even attempt this in C++ - without coroutines it would be hard to maintain the state through the routine. I suppose some system that allows me to perform a rollback of an actor would simplify the error-handling but there are still more potential error cases than if you had access to both sides of the trade and could perform it atomically.

You talk about "using an actor with a FSM", but I can't imagine having to write an FSM for each of the states in the above interaction. Again, comparing that to a 7-line function, it's hard to justify in programmer time, even if it undoubtedly scales further. I appreciate something like Akka simplifies both the message-passing and the state machine aspects, so there is that - but it's still a fair bit of extra complexity, right? (Writing a message handler for each state, swapping the message handler each time, stashing messages for other states while you do so, etc.)

Maybe you can generalise a bit - eg. make all your buying/selling/giving/stealing into one single 'trade' operation? Then at least you're not writing unique code in each case.

As for "writing the code so all messages are idempotent" - is that truly practical? I mean, beyond the trivial but worthless case of attaching a unique ID to every message and checking that the message hasn't been already executed, of course. For example, take the trading code above - if one actor wants to send 10 gold pieces to another, how do you handle that in an idempotent way? You can't send "add 10 gold" because that will give you 20 if the message arrives twice. You can't send "set gold to 50" because you didn't know the actor had 40 gold in the first place.

Perhaps that is not the sort of operation you want to make idempotent, and instead have the persistent store treat it as a transaction. Fair enough, and the latency wouldn't matter if you only do this for things that don't occur hundreds of times per second and if your language makes it practical. (But maybe there aren't all that many such routines? The most common one is movement, and that is easily handled in an idempotent way, certainly.)

Forgive my ignorance if there is a simple and well-known answer to this problem; it's been a while since I examined distributed systems on an academic level.

Handling something like a transaction is really not that different in a distributed system. All network applications deal with unreliable messaging, reliability and sequencing has to be added in somewhere. Modern approaches put it at the layer that defined the need in the first place, as opposed to putting it into a subsystem and relying on it for higher level needs, which is just a leaky abstraction and an accident waiting to happen.

If a client wants to send 10 gold to someone and sends a request to do that, the client has no way of knowing if the request was processed correctly without an ack. But the ack can be lost, so the situation where you might have to resend the same request is present in all networked applications.

Blocking vs non blocking is mostly an issue that comes up at scale. For things that don't happen 100,000 times per second, you don't need to ensure that stuff doesn't block.

As for FSM's, I use them a lot but mostly because I can do them in ruby, and ruby DSL's I find easy to read and maintain. That Akka state machine stuff I don't like, never used it. Seems a lot more awkward then it needs to be. Things like purchasing items is something that's not in the hot path, I can afford to use more expressive languages for stuff like that.


1. Successful MMO developers know a lot more about distributed scale than the "hrrr drrr web-scale" crowd tends to realize.
2. Successful MMO developers rarely divulge all the secrets to their success. This feeds into Point 1.

And yet, pretty much every published example of MMO scaling seems to focus on the old-school methods. You'd think that, given how much has been said on the matter, that there would be at least one instance of people talking about using different methods, but I've not seen one. I was hoping someone on this thread would be able to point me in the right direction. Instead, I'm in much the same position as I was before I posted - people insist that newer methods are being used, but provide no citations. :)


Yes, you need to make virtually all messages and transactions idempotent.

I'd love to see an example of how to do this, given that many operations are most naturally expressed as changes relative to a previous state (which may not be known). I assume there is literature on this but I can't find it.


You only need to write three-phase commit once, and then apply that routine to your transactions going forward; and so on.

I suspected this might be the case but I am sceptical about the overhead in both latency and complexity. But then I don't have any firm evidence for either. :)


If a client wants to send 10 gold to someone and sends a request to do that, the client has no way of knowing if the request was processed correctly without an ack. But the ack can be lost, so the situation where you might have to resend the same request is present in all networked applications.

I think we're crossing wires a bit here. Reliable messaging is a trivial problem to solve (TCP, or a layer over UDP), and thus it is easy to know that either (a) the request was processed correctly, or will be at some time in the very near future, or (b) the other process has terminated, and thus all bets are off. It's not clear why you need application-level re-transmission. But even that's assuming a multiple-server approach - in a single game server approach, this issue never arises at all - there is a variable in memory that contains the current quantity of gold and you just increment that variable with a 100% success rate. Multiple objects? No problem - just modify each of them before your routine is done.

What you're saying, is that you willingly forgo those simple guarantees in order to pursue a different approach, one which scales to higher throughput better. That's fine, but these are new problems, unique to that way of working, not intrinsic to the 'business logic' at all. With 2 objects co-located in one process you get atomicity, consistency, and isolation for free, and delegate durability to your DB as a high-latency background task.

Advertisement


If a client wants to send 10 gold to someone and sends a request to do that, the client has no way of knowing if the request was processed correctly without an ack. But the ack can be lost, so the situation where you might have to resend the same request is present in all networked applications.

I think we're crossing wires a bit here. Reliable messaging is a trivial problem to solve (TCP, or a layer over UDP), and thus it is easy to know that either (a) the request was processed correctly, or will be at some time in the very near future, or (b) the other process has terminated, and thus all bets are off. It's not clear why you need application-level re-transmission. But even that's assuming a multiple-server approach - in a single game server approach, this issue never arises at all - there is a variable in memory that contains the current quantity of gold and you just increment that variable with a 100% success rate. Multiple objects? No problem - just modify each of them before your routine is done.

What you're saying, is that you willingly forgo those simple guarantees in order to pursue a different approach, one which scales to higher throughput better. That's fine, but these are new problems, unique to that way of working, not intrinsic to the 'business logic' at all. With 2 objects co-located in one process you get atomicity, consistency, and isolation for free, and delegate durability to your DB as a high-latency background task.

So this is an interesting topic actually. The trend is to move reliability back up to the layer that defined the need in the first place, instead of relying on a subsystem to provide it.

Just because the network layer guarantees the packets arrive doesn't mean they get delivered to the business logic correctly, or processed correctly. If you think 'reliable' udp or tpc makes your system reliable, you are lying to yourself.

http://www.infoq.com/articles/no-reliable-messaging

http://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.txt

http://doc.akka.io/docs/akka/2.3.3/general/message-delivery-reliability.html

First: Try writing a robust real-time physics engine that can support even a small number like 200 players with vehicles on top of the web architecture that @snacktime and @VFe describe. I don't think that's the right tool for the job.

Second:

You'd think that, given how much has been said on the matter, that there would be at least one instance of people talking about using different methods, but I've not seen one.


Personally, I've actually talked a lot about this in this very forum for the last ten-fifteen years. For reference, the first one I worked on was There.com, which was a full-on single-instance physically-based virtual world. It supported full client- and server-side physics rewind; a procedural-and-customized plane the size of Earth; fully customizable/composable avatars with user-generated content commerce; voice chat in world; vehicles ridable by multiple players, and a lot of other things; timeframe about 2001. The second one (where I still work) is IMVU.com where we eschew physics in the current "room based" experience because it's so messy. IMVU.com is written almost entirely on top of web architecture for all the transactional stuff, and on top of a custom low-latency ephemeral message queue (written in Erlang) for the real-time stuff. Most of that's sporadically documented in the engineering blog: http://engineering.imvu.com/
enum Bool { True, False, FileNotFound };

If you have GDC Vault access, look up a talk by Pat Wyatt from GDC 2013 I think it was... maybe 2012.

If you don't have GDC Vault access, what's wrong with you?! :-P

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

cool thread, though i feel to much incompetent to say something here,

can someone say to me what pings (ping times) has to do it with this?

Is the ping times the source of most troubles or problem is in something different? Is there some perspective that ping times will drop noticably in the

www in the future? can someone say how the range of them is in todays world? How responsible is usually an average connection between a player and a game serwer i mean time to player --?--> serwer , then processing and time to read other side data?

This topic is closed to new replies.

Advertisement