Distributed server architecture for load balancing

_winterdyne_ · 2005-10-27T12:47:30

Okay, seeing as my previous architecture thread didn't seem to draw a lot of flak, I guess most of it was at least semi-rational. I hope. However, I'm trying to puzzle a way around congestion based lag, and I'd like to ramble about it where it can get critique. Pick it apart as best you can please. Apologies if this seems long and drawn out, it is. But I hope you'll find it a relatively interesting read. I mentioned a relevance layer previously, and since what I'm talking about relies on the concept, I'll describe it here, along with a brief description of the architecture. Architecture: A game instance is run on one or more machines (boxes) and consists of a three kinds of processes - a game database server (pretty standard), a 'master' (world) server and a number of generic 'slave' (zone) servers, which operate in a heirarchy. There need only be one server process running on any given box, but there can only be one 'master' within a cluster (sometimes referred to as a microcluster). Edit: Connected clients are handled through a connection object which migrates between server processes as clients move around. The overall structure of the system looks sort of like this: Relevance Graph: The game universe in the system is not broken up into rectangular 2d zones, but instead is organised more like a heirarchical (sp?) scenegraph, with network-aware areas of variable size. This is referred to as the 'relevance graph'. Nodes in this graph can be thought of as 'points of relevance', and coincide with physical features, such as rooms, areas of terrain, etc. Actors and other objects in the game universe a 'relevance limpets' and are *always* attached to a point of relevance. The relevance graph is used to determine event relevance within the game system, both at the network and simulation layers. Anything that happens in a point of relevance is 'most likely' going to be important to everything contained in it. Neighbouring points may only receive events of a certain type (a soundproof, sealed glass box might only receive visually-oriented messages). The graph allows a designer to set what spreads how at any point in the world. Events also have a given radius of effect which is checked with the relevance graph as well. Neighbourhood is not used here to imply physcial adjacency, just that events of certain types occuring within (or passing through) one point of relevance may affect another. Absence of neighbourhood implies that points should have NO bearing on each other. Shortcuts are used around this system in some places, where a specific entity is the target of an event. The master server maintains a list of where all entities are, so a message can be efficiently delivered without propogating along the heirarchy as normal. A server process is given a node of the relevance graph to deal with, and it deals with that and all points leafward (generally 'contained by') of that point, unless another server takes control. Slave servers are also informed of a master, which will manage the addition of slaves. Slaves are informed of neighbour or child relations with other slaves, so direct communication amongst themselves is possible. All of the points of relevance that a server process has control over are referred to as its 'domain'. A domain, functionally, is close to a traditional mmorpg zone, whereas the point of relevance is more conceptually similar. Update process: Updating the relevance graph occurs in three phases, all of which occur in one server 'tick' in the server update threads. The threads on each of the servers in a microcluster are brought into sync during this process. Firstly, a logical update runs, which fires the events from their generators and adds them to any relevant recipients. High priority UDP is used to transfer these across server boundaries if required. This process must complete before the second phase can begin. This phase is started by the master server and triggers the process in slaves through a cascade effect. The second phase of the update is the handling of receivers that have had an event passed to them. This may include the addition of messages to those receivers' outbound queues, including appropriate network messages to an avatar's associated client. This process occurs simultaneously on all server processes in the microcluster, and is synchronised (started, flagged finished) by the master server. Multi processor servers may utilise worker threads to process more than one part of the heirarchy simultaneously. The final phase is network transmission (incoming messages are handled by a separate thread) to clients - this is done on each server in batches, a POR at a time. Firstly out of date or obsolete empirical state updates are discarded from the outbound queues, then those queues are processed and sent. This process has an overall timeout value, and unsent messages are preserved for the next cycle (and are sent in order/priority). Timeout occurences indicate congestion in the area and mark the process as congested. Timing and data stats are kept for data transmission and time to transmit at each POR and can be used to determine where the congestion is actually occurring. UNDER time exits (early finish) for this process is also recorded by the server process. Stats are transferred up to the master server regularly for congestion detection and handling. Illustrative example (edit): Consider the diagram below: Each box represents a node in the relevance graph. Here we have a simple game world where the two areas, Dark Forest and Dwarf Mountain are segregated. Both are 'adventuring areas' and it's assumed the player base will usually be evenly split between these two areas. As such, Dwarf Mountain has been assigned to a slave server. The heirarchies in each domain exhibit internal neighbourhood - Dark_forest_main, the POR that models the bulk of the Dark Forest is designated as a neighbour of the Dark_forest_clearing. Any message generated in either of these can have an effect (subject to range, etc) on the other. Cross-server neighbourhood is illustrated between the cellar, and the tunnel. This is a good example of where the population can be kept low (limited numbers will fit in a cellar or tunnel) to limit the amount of traffic between the two POR's. Parent-child relationships imply neighbourhood (what happens in the clearing is heard and seen in the cottage and vice versa) but only by one step - what happens in the clearing is NOT seen or heard in the cellar. Considerations: I expect the majority of traffic to be chat and object-description exchange. Such traffic should mostly be contained within a single server, with descriptions being drawn from cache, not database. Cross-server traffic is likely to be '/tells' or events occuring 'at the edges' of a server boundary. Good (physical) world design should keep this low. I don't want to enforce a 'zone limit' for occupancy unless I can possibly help it, apart from areas where such a limit makes sense (typically room-like areas). This is an option open to a game designer, and although potentially useful is not a requisite of the library I'm building. I do want to allow more than 50 players to gather in a quiet location and let them meet without them lagging up an entire server, especially somebody's combat. In games where real-time (or almost real-time) combat is used this would be highly annoying. There are a *lot* of mobiles wandering around the world. The system is NOT designed to cope with 'blanket coverage' of players, but more with uneven distributions. Mobiles are aggregated and simplified when not visible to a player, and indeed can usually be aggregated even when visible. The system's designed to be configured at startup (before players connect) so the master server can inform all the various slaves what they need to manage and allow time for load up and synchronisation of the abstract simulation layer (simplified geography and demographics used for statistical simulation of mobiles, resources etc). This takes some time. Congestion here I define as the situation where the network exchange TO CLIENTS in the server process update times out significantly and persistently, apparently due to traffic to an isolatable (leaf) point of relevance. Note that traffic is measured in complete UDP message sizes. Packet storming is a server security issue and is dealt with at the UDP level itself. The strategy I have in mind is as follows: Defining significant as a point where more than 10% of network traffic is left over still to process persistently; Defining persistently as for a period of approximately 3 second, or proportionately less if the volume of outstanding network traffic (congestion) increases. On detecting a congestion situation in a server process, we locate areas of congestion within that process' - first looking at leaf nodes averages, then progress up the relevance heirarchy until a disproportionate congestive average is found at a certain layer. We eventually determine an area of the heirarchy that is responsible for the congestion, and know how much traffic it has pending, how much it typically generates, and also the statistics of all processes, and as a result each box running such processes in the microcluster. What I can't decide is what to do with the guilty chunk of heirarchy once it's identified, and I'm trying to think of ways of reintegrating the heirarchy once it's no longer necessary to be segregated from its greater body - and indeed how to judge that situation. When such a segment of the relevance graph is definitely causing congestion it seems obvious to transfer ownership of that chunk to a quieter process. This has the obvious flaw of causing fragmentation of the relevance graph, which is a Bad Thing. I need to come up with a means of determining whether the cost of transferring a chunk of the graph is worthwhile, and some form of 'defragging' the graph occasionally. Does anyone know of any systems that do this, or has anyone got any ideas for things I might need to track to make this work efficiently? Sorry for the (exceedingly) long post, but hey, I'm scratching my head... and I need coffee. [Edited by - _winterdyne_ on September 19, 2005 5:49:04 AM]

Networking and Multiplayer Programming

Started by _winterdyne_ September 17, 2005 07:22 AM

50 comments, last by _winterdyne_ 19 years, 3 months ago

hplus0603

11,940

October 13, 2005 11:21 AM

I could tell you how it all works, but first you'd have to sign a bunch of legal papers :-)

Quote:
node relations can be queried centrally

The only thing we really need central querying for in the entire system is the relation "given this object ID, what is the home storage server for that object". Everything else is distributed, and scales by adding more discrete hardware, in one way or another.

We don't use Beowulf, but instead built our own application-layer clustering infrastructure.

Simulating objects will never make a remote query within the time of a single step -- doing that would kill performance. In fact, we could probably tolerate having a distributed data center (different servers in different centers), although that's not something we're officially supporting nor currently working to support.

We run one server process per machine. Running multiple processes has no advantage, because the area served by our processes can be irregular in shape (and even discontiguous, although that's usually not a great idea for other reasons). If we need to shift load, we change the area that each machine is responsible for, rather than moving the processes. Each simulating object knows how to move itself to the "most optimal" server for that object, so when we change around mappings, the appropriate objects will automatically migrate. Usually the players won't notice when their objects migrate (because of the "seamless streaming world" implementation, which already involves real-time migration).

enum Bool { True, False, FileNotFound };

_winterdyne_

Author

530

October 13, 2005 12:05 PM

Quote:
Original post by hplus0603
I could tell you how it all works, but first you'd have to sign a bunch of legal papers :-)

Isn't that always the way? :-)

Quote:

We run one server process per machine. Running multiple processes has no advantage, because the area served by our processes can be irregular in shape (and even discontiguous, although that's usually not a great idea for other reasons). If we need to shift load, we change the area that each machine is responsible for, rather than moving the processes. Each simulating object knows how to move itself to the "most optimal" server for that object, so when we change around mappings, the appropriate objects will automatically migrate. Usually the players won't notice when their objects migrate (because of the "seamless streaming world" implementation, which already involves real-time migration).

So, given a change in area on a particular machine/process that change has to be migrated to all processes in the grid? You've stated you use a modified quadtree, I take it this is used to determine which is the most optimal server, given an objects extents and the known areas covered by each process in the grid. Elegant, given a fixed origin coordinate system. I also assume you are generally dealing with a 2D world (as far as zones are concerned).

Couple of questions, I was reading up on DungeonSiege's continuous world design and they ran across floating point precision errors at large distances. In short they overcame this by using an alterable point of reference. Are you using sliding scales for determining quadtree nodes (a 10km tree vs a 1m tree)?

Also, given an irregular shape, how do you determine continuity? Colinear edges on area perimeters? It's one of the reasons my fixed POR's have AABBs rather than arbitrary - I considered the design difficulties of placing continuous arbitrary hulls nightmarish, not to mention the fact I always hated Tetris, especially in 3d, whereas most people can easily figure out how to put together axis aligned box.

Winterdyne Solutions Ltd is recruiting - this thread for details!

_winterdyne_

Author

530

October 14, 2005 05:40 AM

The actual operation of node transformations on events is independent of the number of players. So one, or a thousand players, if only a few events occur at zone boundaries, the system works pretty much the same. Problems occur only if many events are crossing zone boundaries within a short time frame.

Arranging the layout of such areas that interesting, congregation-attracting content is near the centre (out of event range of the boundaries) alleviates this problem.

If we know transformations between neighbouring coordinate systems for zones, when an event crosses a boundary it can be implicitly converted to the new coordinate system when placed in the events list / queues for the new zone.

With event comparisons being done within each zone/process, rather than in a centralised place, there is no need for a unified coordinate system, since events and boundaries can be transformed to the appropriate coordinate system as they migrate. There is little overhead to doing this in addition to the network transmission of those events between machines.

Winterdyne Solutions Ltd is recruiting - this thread for details!

_winterdyne_

Author

530

October 14, 2005 08:43 AM

Yes, in the situation where the congregation is not in a divisible area (loads of players at the bank), or you have one large congregation in a zone as opposed to several smaller ones, there isn't really a lot you can do, whether or not you have a unified coordinate system. Such areas can be designed to minimize the cross-boundary traffic.

I disagree that the conversion cost of events would be substantial, especially compared to the updates that are running regardless of event migration - even the migration of a complex object's collision hulls only implicitly needs a position and quaternion for orientation ([Edit] and subsequent loading of the appropriate mesh, unless a similar mesh is already in play). Subsequent transformations of hull geometry are the responsibility of the physical simulation layer which would occur anyway.

It's more likely that problems occur from the data associated with an event, especially chat events, which have larger data sizes than most other events, and do not loan themselves well to lossless compression methods (like RLE, for example). I reckon network lag will hit before CPU strain.

Edit: Spelling.

[Edited by - _winterdyne_ on October 14, 2005 9:43:28 AM]

Winterdyne Solutions Ltd is recruiting - this thread for details!

hplus0603

11,940

October 14, 2005 02:29 PM

Quote:
I also assume you are generally dealing with a 2D world (as far as zones are concerned).

No, it's a full 3D earth-sized planet. One of the modifications in the quadtree is to make it map to a full sphere, although it does wedges centered on the center of the planet. We could easily use an octree instead; the specifics don't matter that much.

We use double precision for all physics, so we don't have to worry about the sliding scale. That way, we can cover an area from +/-8,000,000 meters without worry. Which means we'd have to switch to another coordinate system if you traveled to the moon. We have the technology to support that, but haven't had the need to implement it.

Regarding events crossing zone boundaries, you can make the observation that all events are either: 1) generated by predictable server-side algorithms or 2) generated by unpredictable users. Then you can build the entire system to make sure that both sides see the same thing, without necessarily having to send anything across at all. If you are really gung-ho on the gory details, then please apply for our open jobs :-)

enum Bool { True, False, FileNotFound };

_winterdyne_

Author

530

October 15, 2005 05:42 AM

Quote:
I was thinking of you having continuous floods of events (ie- position updates) crossing seamless boundries and traversing your hierarchy (and having to be translated numerous times as it spread to many adjacent areas). But if you can eliminate cases of seeing objects at long Line Of Sight (etc) from a players viewpoint (thus having few cross boundry cases) then the problem pretty much goes away.

My PORs can specify what event types they can 'pass along'. For example, designing an anteroom where the entrance and exit do not share a common line of sight can be used to curtail boundary traffic. In most cases event migration is handled by parent POR - this prevents looping in migration paths.

Again, this is two parts layout, and on part architecture - the POR mechanism is not designed for very high activity adjacencies - where possible any hierarchy splitting must be done at low activity adjacencies.

Quote:

The loading delay problem calls for a preload threshold to pull data into memory ahead of need. This buffer area then becomes its own headache with predicting and prioritizing when data needs to be preloaded across machine boundries (or complex hulls etc.. precalculated ) before the actual transition between zones takes place.

My libraries maintain two discrete levels of simulation - coarse, and fine.

Coarse simulation abstracts away fine route finding (falling back to predesigned coarse routes (less nodes), large amounts of physics, actual interactions, and is used to simulate world behaviour when a client is not aware of it. A server's entire domain is modelled at this level. (NPC) Entity migration can occur at this level for special circumstances, but the NPC data itself is not required.

Fine simulation is the interactive level, where PORs that contain client entities, and PORs that are potentially relevant to clients, as specified by the relevance graph (and manually definable Potentially Relevant Sets for each POR) are modelled. This is the level where real collision detection and entity movement (fine detail route maps) etc. are handled. Asset management runs on recent use stats, with items used infrequently or a long time ago being unloaded first.

Part of the load process is the conversion of abstract data stored in the coarse simulation (say NPC position-scale along a coarse node map link) to finer detail simulation data (fine node map position) and finally to actual physical position of the entity.

Quote:

Default place holder data is handy when the transition happens before the data transfer is completed...

Actually I use placeholders for a lot more besides - including asset substitution on the Client, if or load of asset is pending on the client, we can define that any sword might use a default sword item.

Using placeholders for any critical event on the server is not a great idea. It's more sensible to use common data - for example every creature has a similar collision hull, altered only by scale, position, orientation, whatever - it's then easier to ensure the server remains authoritative.

Winterdyne Solutions Ltd is recruiting - this thread for details!

_winterdyne_

Author

530

October 16, 2005 10:02 AM

Yes, I have a similar mechanism for quest generation - I think this is the way a lot of projects will (sensibly) go for small team, or large gameworld projects- manually scripting such quests and events is labour intensive, and that's something the small teams can't afford.

In the case of trying to implement systems like this, a very structured set of abstractions have to be used - in effect we end up writing not just one MMOG, but several, as events have to be mirrored or handled at several layers of abstraction.

It remains to be seen whether a product implementing this kind of system succeeds in presenting a believable world - there's a discussion in the game design forum (I believe) about random events in RPGs. Using randomly generated plotlines can always lead to a contrived feel, and avoiding this is the holy grail as far as procedural content generation is concerned.

It also remains to be seen whether the efforts we put into designing these systems are rewarded by something playable at the end.

Winterdyne Solutions Ltd is recruiting - this thread for details!

hplus0603

11,940

October 16, 2005 11:58 AM

If your terrain is procedurally generated from controlling source data, why send it from the server at all? We use the same kind of approach, and just pre-install the controlling source data; we then generate terrain just-in-time on the client.

enum Bool { True, False, FileNotFound };

_winterdyne_

Author

530

October 16, 2005 12:53 PM

The above was me... accidentally deleted a cookie. :-)

Winterdyne Solutions Ltd is recruiting - this thread for details!

_winterdyne_

Author

530

October 17, 2005 04:22 AM

Mmm... cellular automatae... you've been reading the same papers I have I think.

Actually, a lot of your abstraction level stuff is spookily similar to my own, especially the multiple 'levels' of activity... Given that you're operating on a fast LAN and that your terrain is deformable, I understand your requirement for streaming terrain now. You're going to need *really* fast disc access for your boxes in this case, with everything being loaded in and out.

How do you handle LOD-level boundaries between servers? Do you synchronise terrain data between servers, or let multiple servers talk directly to your clients?

Persistence is a fun topic here, too - do you dump your terrain data into the DB or store it locally? That said, you don't even mention if your simulation IS persistent (between runs).

Winterdyne Solutions Ltd is recruiting - this thread for details!

Distributed server architecture for load balancing

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Distributed server architecture for load balancing

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines