Advertisement

MMOs and modern scaling techniques

Started by June 10, 2014 01:26 PM
65 comments, last by wodinoneeye 10 years, 4 months ago

Redis will have that issue, Riak , Cassanda, Dynamo won't.


I have two questions:

1) Have you actually tried it? I've measured those document stores pretty extensively, and I concluded they're not up to that task. Probably because they're not designed for that task. You might as well say that Oracle and DB/2 can scale here, too, because they also support distributed transactions.

2) If you get "horizontal scalability" but at the cost of 1,000x less individual efficiency, then do you think that a business based on that will actually survive? The reason it works for web is that 99.9% of the time, cross-user interaction at low latency isn't important. For games, 99.9% of the time, it is important.

Or, to put it another way: Do you think that Google Talk/Hangouts, or Microsoft Messenger, or Skype, use "web technologies" for the real-time chat parts? Do you think the brokering of an AIM message goes through Cassandra?
enum Bool { True, False, FileNotFound };

Redis will have that issue, Riak , Cassanda, Dynamo won't.


I have two questions:

1) Have you actually tried it? I've measured those document stores pretty extensively, and I concluded they're not up to that task. Probably because they're not designed for that task. You might as well say that Oracle and DB/2 can scale here, too, because they also support distributed transactions.

2) If you get "horizontal scalability" but at the cost of 1,000x less individual efficiency, then do you think that a business based on that will actually survive? The reason it works for web is that 99.9% of the time, cross-user interaction at low latency isn't important. For games, 99.9% of the time, it is important.

Or, to put it another way: Do you think that Google Talk/Hangouts, or Microsoft Messenger, or Skype, use "web technologies" for the real-time chat parts? Do you think the brokering of an AIM message goes through Cassandra?

#1. Yes. Up to what task specifically? Speaking of Riak, since I have the most experience with it. Assuming you have adequate provisioning, it's fast enough to write/read within a single frame from server <--> ring, several; depending on how your cluster is set up, if you're using TCP, PB, or a custom Infiband driver. We're not talking concrete numbers here and that significantly impacts the nature of the discussion we're having, so please use some if you're going to attempt to refute every claim. This discussion requires context, state your perception of "loose physics", because it's possible given the correct constraints.

Oracle <-> Riak/Cassandra/Dynamo is a false equivalence, Distributed transactions in Oracle are about forwarding across data layers, Distribution across a Riak cluster is about providing fault tolerance and improving read access through denormalization. So no, I may very well not say that, because it is false.

#2. You're providing an extremely exaggerated number bordering on reductio ad absurdum. In the context of comparing it to Redis. It's an order of 2-3x average case, order of magnitude in worse cases, without application specific optimizations. Which again, depending on the latency tolerance of a particular application, may very well be acceptable in exchange for the ability to scale to hundreds of thousands to millions of users efficiently(at least as far as storage/retrieval go). If you're business teetering on the brink because managing 10 servers versus 3-4 is the breaking point(or 100 vs 40...or whatever scale you choose for that ratio), then your problems as a business are elsewhere, in marketing, sales, market fit...etc. Cause operation costs should not impact your survivability at those ratios.

As to the last point, I'm not even trying to argue that, but the question is "How much?" If your latency tolerance is above X, then these techniques can work. If it's not, than they won't, and I've never claimed otherwise. So you have to determine what that number is for your application before you can discuss techniques.

#3. No of course I don't. I fail to see the value in this assertion, you're making an absurd claim as to what I believe is feasible by comparing voice/video systems that have hard lock-step synchronized real-time constraints vs the soft real-time constraints we're discussing and that are common to a lot of games(including traditional MMOs).

I'm not making any claims that the things I've said are the best techniques in the world for every occasion, but you seem to be arguing that because they don't fit a very specific criteria for particular type of undefined game, that their value is near useless. Which is "throwing the baby out with the bathwater".

Frankly, I no longer think this is a constructive conversation amongst peers, and will avoid further participation due to professionalism.

Advertisement

Yes. Up to what task specifically?


Running a distributed physical simulation for an MMO, as stated in the initial question in this thread: "a shared persistent world running in real-time."

Oracle <-> Riak/Cassandra/Dynamo is a false equivalence, Distributed transactions in Oracle are about forwarding across data layers


While Oracle and DB/2 can do that, distributed transactions across partition tables are explicitly about providing horizontal scale-out.

You're providing an extremely exaggerated number bordering on reductio ad absurdum


No, I am not. I have profiled this extensively. Redis supports about 1/1000th as many native operations per second over the network as a dedicated physics simulation server does. So, the question is not one of "3-4 versus 10," the quesiton is one of "1 versus 1000." And Redis still doesn't scale out correctly for the use case we discuss in this thread.

Why do I talk about Redis, and not, as you suggest, about Riak? Because I did a lot of benchmarking of Riak versus Redis versus a number of other databases, specifically to support a soft-latency online environment, and Riak came second-to-last whereas Redis did about as well as possible given the circumstances.

you're making an absurd claim as to what I believe is feasible by comparing voice/video systems that have hard lock-step synchronized real-time constraints vs the soft real-time constraints we're discussing


Text chat has no hard real-time constraints. A shared persistent world in real-time has more of those constraints. However, I also feel as if you're not reading the words that I actually write, and are tilting against some windmill that I didn't put there. That may be on me, though.

you seem to be arguing that because they don't fit a very specific criteria for particular type of undefined game, that their value is near useless. Which is "throwing the baby out with the bathwater".

Frankly, I no longer think this is a constructive conversation amongst peers, and will avoid further participation due to professionalism.


I think I've been pretty clear about what I know is needed for persistent world simulations, as defined in the beginning of this thread, and what my experience actually building these systems has been. My guess is that you have not actually built any distributed simulation system where users push buttons and characters move on the screen, but that's just an impression I get from your description; it'd be great to know for sure.

I will let the seeming ad hominem attack and storming out of the room stand for itself.

enum Bool { True, False, FileNotFound };

I'm trying to understand what problem being discussed here is.

Is it basically in a game like WoW where there is some big event and half the Alliance players on an instance all decide to meet in a single shopkeeper's room in Stormwind? So you have to manage an exponentially (? or at least non-linear) increasing number of interactions?

eg:

1 player = 0 interactions

2 players = 2 interactions

3 players = 6 interactions

4 players = 12 interactions

5 players = 20 interactions

6 players = 30 interactions

....... non-linear scaling ....

My only comment here isn't from a technical point of view but from a playability point of view. After a certain number of players, does it really matter if things aren't being perfectly computed? For example if 20,000 players are in a single bar, you wouldn't even be able to see anything at all if it was rendered "correctly". It might be a better experience for players if you just show a random say 30 or even 200 players maximum and it is what it is.

So you have to manage an exponentially


Because you asked: the growth you illustrate is known as "polynomial" (n-squared in this case) and, yes, that is the main source of problems in this case.

It might be a better experience for players if you just show a random say 30 or even 200 players maximum and it is what it is.


I tried that in 2004, and it didn't lead to a good gameplay experience.

Problems we ran into include:
- Which 30 or 200 players do you include, versus excluding the rest?
- How much hysteresis do you include in the visible set, to avoid overloading the clients with constant entity visible/invisible messages? (and note that any degree of user customization will make entity-visible events very expensive the first time, and it's the first time that matters, not the cached-best-case.)
- Do you need to see the same things as your team/group members?
- Are there special extra-important actors?
- What if there are actors that do affect you or your team members physically, and you're not "seeing" them?

The actual, end-user gameplay result is terrible if you get these wrong, and even if you get them right, but what you "see" is the nearest 50 players in a small circle around you, plus your team mates, and the rest of the plaza/square/zone looks empty, it's not a good experience.

Separately, there's the question of how the servers solve this problem -- the servers have to "see" all actors, and the n-squared potential for interactions is much more urgent in a gameplay scenario than in a traditional web server scenario, so traditional "web scale" solutions are not a great match for these types of situations. (They work great for Farmville or LoL though, which were explicitly non-goals of this discussion in the original question.)
enum Bool { True, False, FileNotFound };


After a certain number of players, does it really matter if things aren't being perfectly computed? For example if 20,000 players are in a single bar, you wouldn't even be able to see anything at all if it was rendered "correctly". It might be a better experience for players if you just show a random say 30 or even 200 players maximum and it is what it is.

Be careful not to confuse client-side rendering concerns with Kylotan's server-side distributed focus.

In your situation, its still important to render the world as it exists on the server within the client but the client can impose some rules which simplify the rendering phase but still doesn't impair the player's ability to find their friend, trade with them, converse via say, whisper, etc or even inspect their gear or what not.

If we were to consider WoW and consider Kylotan's questions, several systems quickly popup that likely are implemented as separate processes on the server side.

The first service is chat. The actual game client opens a separate connection to a chat service when you enter the game world. This has two immediate benefits on the server side architecture as that the world servers don't have to worry with any asynchronous operations between processes for chat, and secondly that chat is not a concern of those servers at all. The chat servers themselves now bear the brunt of logging and any other operation necessary that revolve around chat, including resolving who gets communication sent over /yell and /say based on proximity and distributed communications in public and private channels.

Another distributed service is likely something regarding player inventory. With the inception of cross-realm services, your game object is interacting with other game objects from other shards in a single hosted world simulation. But your inventory is still managed by your local realm shard. Assuming we both played on the same realm and are standing in Goldshire, it's possible that area of the world is hosted by another realm which isn't our home realm, thanks to cross-realms. Now I go to trade you a pet I recently captured from a pet battle. The realm that Goldshire is hosted on will receive the trade request, distribute that to the inventory service for our home realm and thus deduct the item from my inventory and add it to yours.

These are two overly simplified examples, but hopefully it puts Kylotan's questions into a better perspective for you.

Advertisement

https://github.com/TrinityCore/TrinityCore/blob/master/src/server

any questions? i mean it does its job well handling thousands players at a time, rarely we need more players in a single realm

this one is relatively simple, blizzard uses different servers for example for battle.net communication and for instances, i heard alot of stories from players jumping in a molten core just to get the 'instance servers are full' notification

Yes. What is it? And what has it todo with the topic?

Regarding speed of light, it should be noted that most long-range cables are not copper but fiber-optic, which are even slower, and they're not going to get much faster. The most modern low-latency transatlantic cables have an alleged latency of 60ms (I measure between 75 and 78ms for the transatlantic part with Level3, but they might not treat me precisely as their most important customer, it is probably possible to get QoSed better!), which puts them, assuming roughly a 5500km distance, at about 29% the speed of light in vacuum.

It is not easy (maybe not even possible) to make the light inside those cables travel much faster, for two reasons (actually both are kind of the same reason). Those cables transmit light by total internal reflection. Which implies that:

(a) the optical medium has hard constraints on its index of refraction (otherwise no reflection will happen), and the index of refraction directly affects the speed of light and

(b) the light ray bounces zig-zag inside the cable, making the distance travelled considerably longer than it actually is

As for protocols, they do add overhead, but I think there isn't that much to strip off. Taking the InfiniBand example from earlier and ignoring the fact that InfiniBand is a kind of DMA-over-wire (so, maybe suitable for a cluster with a private network but surely not for communicating over the internet).

Say that we are indeed able to completely strip off 2 ms inside your computer by using InfiniBand (or whatever). That's still very little compared to 60ms for going through Ye Bigge Wyre over Ye Altanteeque. Or compared to going from Oakland to New York, or anything the like.

Routers already operate in the considerably-below-millisecond range anyway (right where I'm sitting at this moment, tracert tells me that it takes between 24 and 25ms for the packet to make it through ATM, and little over 1ms total for making 4 hops to DECIX (about 18km distance) and over to a datacenter which is 224km away. Which is a pretty awesome latency on behalf of the backbone (especially considering that they're routing billions of packets while I'm typing this). It takes around 700µs for the signal to travel that distance at theoretical copper-wire speeds, which puts "little over 1ms alltogether" into the "woah, fucking heroic" range for the involved routers. Not much to strip off here, really.

[...] into the "woah, fucking heroic" range for the involved routers.

+1 simply for this comment

This topic is closed to new replies.

Advertisement