Distributed server architecture for load balancing

_winterdyne_ · 2005-10-27T12:47:30

Okay, seeing as my previous architecture thread didn't seem to draw a lot of flak, I guess most of it was at least semi-rational. I hope. However, I'm trying to puzzle a way around congestion based lag, and I'd like to ramble about it where it can get critique. Pick it apart as best you can please. Apologies if this seems long and drawn out, it is. But I hope you'll find it a relatively interesting read. I mentioned a relevance layer previously, and since what I'm talking about relies on the concept, I'll describe it here, along with a brief description of the architecture. Architecture: A game instance is run on one or more machines (boxes) and consists of a three kinds of processes - a game database server (pretty standard), a 'master' (world) server and a number of generic 'slave' (zone) servers, which operate in a heirarchy. There need only be one server process running on any given box, but there can only be one 'master' within a cluster (sometimes referred to as a microcluster). Edit: Connected clients are handled through a connection object which migrates between server processes as clients move around. The overall structure of the system looks sort of like this: Relevance Graph: The game universe in the system is not broken up into rectangular 2d zones, but instead is organised more like a heirarchical (sp?) scenegraph, with network-aware areas of variable size. This is referred to as the 'relevance graph'. Nodes in this graph can be thought of as 'points of relevance', and coincide with physical features, such as rooms, areas of terrain, etc. Actors and other objects in the game universe a 'relevance limpets' and are *always* attached to a point of relevance. The relevance graph is used to determine event relevance within the game system, both at the network and simulation layers. Anything that happens in a point of relevance is 'most likely' going to be important to everything contained in it. Neighbouring points may only receive events of a certain type (a soundproof, sealed glass box might only receive visually-oriented messages). The graph allows a designer to set what spreads how at any point in the world. Events also have a given radius of effect which is checked with the relevance graph as well. Neighbourhood is not used here to imply physcial adjacency, just that events of certain types occuring within (or passing through) one point of relevance may affect another. Absence of neighbourhood implies that points should have NO bearing on each other. Shortcuts are used around this system in some places, where a specific entity is the target of an event. The master server maintains a list of where all entities are, so a message can be efficiently delivered without propogating along the heirarchy as normal. A server process is given a node of the relevance graph to deal with, and it deals with that and all points leafward (generally 'contained by') of that point, unless another server takes control. Slave servers are also informed of a master, which will manage the addition of slaves. Slaves are informed of neighbour or child relations with other slaves, so direct communication amongst themselves is possible. All of the points of relevance that a server process has control over are referred to as its 'domain'. A domain, functionally, is close to a traditional mmorpg zone, whereas the point of relevance is more conceptually similar. Update process: Updating the relevance graph occurs in three phases, all of which occur in one server 'tick' in the server update threads. The threads on each of the servers in a microcluster are brought into sync during this process. Firstly, a logical update runs, which fires the events from their generators and adds them to any relevant recipients. High priority UDP is used to transfer these across server boundaries if required. This process must complete before the second phase can begin. This phase is started by the master server and triggers the process in slaves through a cascade effect. The second phase of the update is the handling of receivers that have had an event passed to them. This may include the addition of messages to those receivers' outbound queues, including appropriate network messages to an avatar's associated client. This process occurs simultaneously on all server processes in the microcluster, and is synchronised (started, flagged finished) by the master server. Multi processor servers may utilise worker threads to process more than one part of the heirarchy simultaneously. The final phase is network transmission (incoming messages are handled by a separate thread) to clients - this is done on each server in batches, a POR at a time. Firstly out of date or obsolete empirical state updates are discarded from the outbound queues, then those queues are processed and sent. This process has an overall timeout value, and unsent messages are preserved for the next cycle (and are sent in order/priority). Timeout occurences indicate congestion in the area and mark the process as congested. Timing and data stats are kept for data transmission and time to transmit at each POR and can be used to determine where the congestion is actually occurring. UNDER time exits (early finish) for this process is also recorded by the server process. Stats are transferred up to the master server regularly for congestion detection and handling. Illustrative example (edit): Consider the diagram below: Each box represents a node in the relevance graph. Here we have a simple game world where the two areas, Dark Forest and Dwarf Mountain are segregated. Both are 'adventuring areas' and it's assumed the player base will usually be evenly split between these two areas. As such, Dwarf Mountain has been assigned to a slave server. The heirarchies in each domain exhibit internal neighbourhood - Dark_forest_main, the POR that models the bulk of the Dark Forest is designated as a neighbour of the Dark_forest_clearing. Any message generated in either of these can have an effect (subject to range, etc) on the other. Cross-server neighbourhood is illustrated between the cellar, and the tunnel. This is a good example of where the population can be kept low (limited numbers will fit in a cellar or tunnel) to limit the amount of traffic between the two POR's. Parent-child relationships imply neighbourhood (what happens in the clearing is heard and seen in the cottage and vice versa) but only by one step - what happens in the clearing is NOT seen or heard in the cellar. Considerations: I expect the majority of traffic to be chat and object-description exchange. Such traffic should mostly be contained within a single server, with descriptions being drawn from cache, not database. Cross-server traffic is likely to be '/tells' or events occuring 'at the edges' of a server boundary. Good (physical) world design should keep this low. I don't want to enforce a 'zone limit' for occupancy unless I can possibly help it, apart from areas where such a limit makes sense (typically room-like areas). This is an option open to a game designer, and although potentially useful is not a requisite of the library I'm building. I do want to allow more than 50 players to gather in a quiet location and let them meet without them lagging up an entire server, especially somebody's combat. In games where real-time (or almost real-time) combat is used this would be highly annoying. There are a *lot* of mobiles wandering around the world. The system is NOT designed to cope with 'blanket coverage' of players, but more with uneven distributions. Mobiles are aggregated and simplified when not visible to a player, and indeed can usually be aggregated even when visible. The system's designed to be configured at startup (before players connect) so the master server can inform all the various slaves what they need to manage and allow time for load up and synchronisation of the abstract simulation layer (simplified geography and demographics used for statistical simulation of mobiles, resources etc). This takes some time. Congestion here I define as the situation where the network exchange TO CLIENTS in the server process update times out significantly and persistently, apparently due to traffic to an isolatable (leaf) point of relevance. Note that traffic is measured in complete UDP message sizes. Packet storming is a server security issue and is dealt with at the UDP level itself. The strategy I have in mind is as follows: Defining significant as a point where more than 10% of network traffic is left over still to process persistently; Defining persistently as for a period of approximately 3 second, or proportionately less if the volume of outstanding network traffic (congestion) increases. On detecting a congestion situation in a server process, we locate areas of congestion within that process' - first looking at leaf nodes averages, then progress up the relevance heirarchy until a disproportionate congestive average is found at a certain layer. We eventually determine an area of the heirarchy that is responsible for the congestion, and know how much traffic it has pending, how much it typically generates, and also the statistics of all processes, and as a result each box running such processes in the microcluster. What I can't decide is what to do with the guilty chunk of heirarchy once it's identified, and I'm trying to think of ways of reintegrating the heirarchy once it's no longer necessary to be segregated from its greater body - and indeed how to judge that situation. When such a segment of the relevance graph is definitely causing congestion it seems obvious to transfer ownership of that chunk to a quieter process. This has the obvious flaw of causing fragmentation of the relevance graph, which is a Bad Thing. I need to come up with a means of determining whether the cost of transferring a chunk of the graph is worthwhile, and some form of 'defragging' the graph occasionally. Does anyone know of any systems that do this, or has anyone got any ideas for things I might need to track to make this work efficiently? Sorry for the (exceedingly) long post, but hey, I'm scratching my head... and I need coffee. [Edited by - _winterdyne_ on September 19, 2005 5:49:04 AM]

Networking and Multiplayer Programming

Started by _winterdyne_ September 17, 2005 07:22 AM

50 comments, last by _winterdyne_ 19 years, 3 months ago

_winterdyne_

Author

530

October 19, 2005 04:41 AM

Quote:

The clients currently use a 'change timestamp' to see if they have to request a new copy of an area which may have deformed/changed since the client last had that area in a LOD3 context. The client only keeps a connection to 9-16 LOD3 areas to get sent their update event streams, but may pass thru a succession of hundreds if the player moves in a straight line(these get cached to disk on the client).

This implies visual range is less than 1 area. I use a similar method for terrain chunks, using specialised fixed POR's to do so. I don't consider it necessary to use a connection per area, (I'm actually using UDP so connection is an abstract term) and can use a quadtree to rapidly determine relevant areas for a client, since fixed POR's have a unified orientation - a common coordinate system is derived from the offsets of the POR's involved. The quadtree tests are expandable to non-uniform zones, since there is a policy of containment for parent-child relations. (If a parent is completely enclosed in a quadtree test, we can assert that it and its entire sub-hierarchy is relevant). Then relevance layer culling can be performed to narrow the required set.

Quote:

I dont know if there is much different for anybody's 'abstract' execution without getting down to specific details. The interactions are simplified/generalized and based on logic/equations to approximate what goes on far from the client/players and enact some kind of reasonable/likely patterns of behavior. There still can be a more than a few detail/factors/coeffcients/states to represent the 'minimized object (and require scripting complexity that rivals what current games have for the 'full' detail AI).

I find that in general abstracted systems are difficult to generalise, and can vary wildly depending on the kind of behaviours you're trying to simulate. I had the priveledge of attending an audience with Jeff Minter demonstrating his game space principles which was fascinating in this regard.

Quote:

The disk use isn't quite as bad as you might think, as zone transitions dont happen for each client too frequently (area size being reasonably large - I projected crossing an area at run took 30 seconds) and the preloader gives adaquate time for data to move 'in_mem' ahead of need (normal disk requests serviced in seperate thread...)

Although threading only provides advantages in multi-processor hardware, I operate a similar method (I've segregated network transmission, which is my presumed bottleneck, most of my simulation held in RAM).

Quote:

Currently the design is to have the client talk directly to the server which owns each 'area' (client gets notified of any change of ownership). The master server owns the DB and distributes copies when 'areas' get farmed out to other slave-servers (eventually it will probably do nothing but this distribution and assignment). There is only one active copy of the data allowed on the servers.

Similar to the discussed method of farming off 'domains' above, then.

Events will be transfered across boundries (to a different server if needed) and most events wont travel more than one area away from the area they occured in (again because of the large size of the areas) and like what you described, there will be an event magnitude filter radius to cull out unneeded inter-area event transfers. Each area has its own event accumulation queues (possibly with a prefilter on insert). 'Realized' objects have their own event queues and events directly effecting them are added, otherwise the object may monitor the event queue of the area it is in and any appropriate adjacents'.

Quote:

The regular structure (grid) of the primary terrain makes boundry checks easy and unitizes alot of operations BUT the 'structures', 'tunnels', and 'mobiles'
offer many of the complications of your irregular areas (and Im sure to face the same problems which I hope can be localized).

These aren't as hard as you might think, providing you have a fixed hierarchy. A structure place directly on a terrain area would pass its event to its parent for distribution, and that parent would, if necessary, forward it on to adjacent terrains. It's important to be aware that there is an additional lag in event forwarding - and a certain amount of synchronisation and timeout checking has to be expected. Ensuring that interesting content is easily contained by a particular domain is the caveat - people won't hang (in great numbers) around the edges of your areas if there isn't anything to do there. You might get some griefers hoping to catch someone with load-lag, but given a good prefetching scheme, this shouldn't be too much of an issue.

Quote:

I have yet to test to see if a broadcast of events on the LAN will have any advantage. They will most likely be marshaled into groups of events to hold down the packet count.

I really wouldn't broadcast to all areas unless absolutely necessary. System alerts ('The server will shut down in 10 mins') ok, but whatever you do, localise chat. An MMOG is just a chatroom with toys.

Quote:

My data for each area is packaged as a contiguous block with internal offsets for pointers and its own heap/freechains. This makes it easy to relocate
without alot of serialization overhead. 'Lackey' objects contract into small sets of coefficients and 'significant' objects are only markers as they live in an AI process and dont contract - they interact as if they were clients and maintain a much more complex world representation.

So you transfer ownership without commital to the DB / central store? I assume you assert the completion of such a transfer in some way? In my case the hierarchy helps, since the parent can cache events for transferring objects, and then forward them to the appropriate new domain, once relocation is complete. A lag is incurred for events directly related to the transferring object, but other event targets can proceed 'normally', allowing for access lag on the machines involved in the transfer.

Quote:

Its definitely a persistant system. The areas once built, are allocated a slot in the DB files and then can roll in and out of memory quickly -- simple random access and 'synchonization' files written at shutdown from the world state held in memory(area index web, etc..). Checkpoint copies of all the data files will probably be needed to minimize impact of server abort. (copying of Gigs may require some copy-while-updating methodology).

So you're commiting to DB when areas are LOD'd out, or at shutdown?

Quote:

The AI data and processing for the 'significant' objects is going to dwarf the rest of the system.

The AI is going to (Ive only done preliminary work for this part) do planning
using a hierarchical/inheritted set of behaviors (goal solutions) and a preference system to decide 'best' solution selection and priority.
I want(plan) to have a self adjusting system for the 'preferences' that will analyze logs of previous actions/results (probably offline) and try to determine better solution selections for particular situations.
This part of the project is probably going to take years (in typical AI, the data (scripted logic) usually takes 90% of the effort and remainder the engine).
The simulation engine is really only the visualization/simulation needed to facilitate my interest in AI (someday I actually want to get back to that part of it).

Sounds interesting! I've done preliminary work on the AI for Bloodspear (the AI included in the libraries basically will comprise A* and some requirement / supply / risk assessment heuristics) but similarly, that part of the project is beastly, even though I'm not implementing any form of learning system.

Winterdyne Solutions Ltd is recruiting - this thread for details!

_winterdyne_

Author

530

October 21, 2005 05:10 AM

Quote:

I find that in general abstracted systems are difficult to generalise...

I can't believe I wrote that. Gah! It's these posts getting longer and longer, I swear... I'll try not to contradict myself in the same sentence again. :-)

Quote:

Yes the detail can be alot if you have things like complicated faction politics or strategic considerations -- there may be abstract entities of different complexity ie- a town plays politics so needs alot of attributes, but another area is a farm that only supplies basic resources (thus simple coefficients ..)
The town has active AI but the farm is mostly passive (is manipulated by the active entities). The interesting part of such a mechanism is to translate the actions in 'abstract' mode into all the appropriate changes when the area is 'realized' again (in full detail).

I have a similar tiered structure lined up for Bloodspear, although rather than keeping everything AI controlled, I'm planning on letting players loose with the controls, within certain bounds.

Quote:

Actually having the a file thread can add performance improvements because that thread goes to sleep for long periods while the file operations complete...

True, you end up with slower behaviour in both threads, but at least file access won't stall your app. That said, you can end up with synchronisation issues unless your prefetching algorithm is good.

Quote:

I currently have 4 thread -- server, file, network, and application monitor console.

I haven't implemented a file thread (no need to yet) but otherwise, same here. I've not tried running more than one server process (using the architecture above yet - still coding it) but I suspect it may get confusing with the multiple logs. I'm planning a separate monitor client that will merge event streams and be able to sort them and perform other 'friendly' functions. Sure, it'll cause loads of extra lan traffic (and I'd no way do this with a complex gameworld), but it may well help with debugging event transfers.

Quote:

All my listening objects are on a list in every area they are inside of or adjacent to (within that 3x3 LOD3 set of areas) and adjust their membership in those listen lists when they cross the regular boundries.

Therefore all listening objects are transmitted to directly by the areas that are relevant to them. This presents an issue when you have a large number of mobile listening objects. Crossing each boundary causes a minimum of 3 releases and 3 acquisitions for LOD3 definitions alone. The mobile listener then has to be packed and transferred to the appropriate new area 'owning' server.

I'm starting to think about network capacity with all this - bearing in mind the streaming terrain data as well - you might need more than one network thread and a dual-network setup a local network on a fast switch and an external network (behind fast, smart router) with a high capacity pipe. You're going to need some big bucks to build and host this I think.

Quote:

The AI nodes that run the 'significant' AI objects will be a seperate server so the events will have to cross machines anyway (and clients always get the events externally posted...)
I will have to determine if I can get away with having 'dumber' AI objects not look across machine boundries for events -- thus would execute on the server containing the 'area' they live in (and only get events from adjacent areas only on the same server machine.

I guess this would depend on your world design. If certain objects have a low 'activity radius' for want of a better term, if they're centred in an area, you're effectively culling them from the rest of your world.

Quote:

I will have to do a performance test to see if the extra packet handling causes any extra load -- every server having to decide if any objects resident take the events in the broadcast packets basicly the difference between pre filtering and dispatching (source listener lists) versus general broadcast and post filtering (destination listener lists).

If you're going to broadcast at all, I'd suggest that applying both filering methods is a really good idea. Many people 'ignore shouts', which can really cut down on the amount of traffic to despatch externally. Prefiltering the event with a range cuts down the internal traffic as well. Given your heavy use of streaming events (LOD acquisition / release and the streaming terrain data) I'd absolutely apply every filter I could think of to cut down everything else.

Quote:

I dont expect alot of transfering between machines (for load leveling) of area execution. The leveling will mostly be achieved by initial assignment of areas to the least used server (and even then, theres a preference for groupings to keep adjacent areas on the same server).

With your structure as above, a heavy load in an area will drag down everything that's visible to it - simply by virtue that each area is responsible for streaming its terrain. This could get really choppy - you might want to start cutting down frequency of updates for remote terrains.

Quote:

Im not sure yet how the performance of the AI objects will work out. I dont think a single objects processing(and data) will be broken up because that would cause major inefficiency (there may be a few cases like a seperate process{machine} for A* batches that could be seperated).

Batching off A* processing is quite a good idea. The process in question can be made highly specialised. Might not cope too well with an environment that can potentially change moveable paths though. If a terrain gets a steep-sided crater in it, would the routefinders update their maps (and in-progress route calculations) accordingly? I have this problem with nodemaps being rotated as a result of rotation of a mobile POR, subject to external gravity - a ship listing at 50' has a very different set of moveable routes to the ship when it's level. Fun!

Quote:

So you're commiting to DB when areas are LOD'd out, or at shutdown?

The area index web (a sparse array hash) is kept in memory and the map could be corrupted if all the data isnt saved at once. I will probably add a checkpoint mechanism to save all data and copy all the files (!! GBs for the 'area' storage file though).
As I mentioned before, some kind of partial precopy (of out of memory data -- the majority) and quick write of any remaining in-mem data for the checkpoint may be needed to minimize any lockdown duration/delay.

Quote:

Sounds interesting! I've done preliminary work on the AI for Bloodspear (the AI included in the libraries basically will comprise A* and some requirement / supply / risk assessment heuristics) but similarly, that part of the project is beastly, even though I'm not implementing any form of learning system.

Quote:

The AI system plans tasks to solve an object's goals...
This kind of heavyweight behavioral AI can require huge amounts of CPU and I expect there will be a need to keep track of statistics for each individual object to predict how much resource it will need to distribute (load level) the set of all such objects across the available AI servers.

Yikes. I have to say I salute your bravery in attempting to implement this at all, never mind in an online game. A reactive, planning AI is still something that's missing from a lot of single player FPS games. I'd love to hear how you get on with this.

Winterdyne Solutions Ltd is recruiting - this thread for details!

_winterdyne_

Author

530

October 27, 2005 12:47 PM

Quote:

The file thread sleeps much of the time (waiting for file completions - wakeup by system). My current version has a queue of read/write requests from the game server process and is only doing one read/write at a time (I may change this to simultaneous requests if on seperate disks). I also have to investigate how well OSs handles queueing requests (winchester controllers were supposed to optimize head seeks on multiple requests, but Im not sure that the OSs do this effectively these days.....)

I meant slower behaviour during an access (obviously). I've not heard of any head-seek optimisations in any mainstream modern OSs to be honest, not that I've dug for the information. Perhaps some of the heavyweight server-designed *nix OSs do this - might be worth looking into.

Quote:

There are many areas (world grid is upto 4k x 4k areas) but most transitions happen with the destination area being loaded into the same server (this isnt a MORPG is for a much lower expected client count simulation). The handoff to a different area server passes a minimal token (proxy) that references the AInode(Active NPC) or Client(Player) ---- most simplex NPCs (proecssed on server iteself) dont move around alot and rarely would cross server machines.
Also with the size of the areas, transitions happen only about every 30 seconds.

I see, so a server holds many areas, and you have a number of servers. I take it your master server to governs distribution of grid members (server machines) to all server processes (AI and area simulation) in the grid. Not that dissimilar (sp?) to my hierarchy where a given server can have slave server processes under it.

Quote:

Again this isnt a commercial game, rather a low seat count simulation intended for LAN. Im counting on 1000Mbps LAN cards (cheap now -- possibly 2 for the serevr boxes if needed). A typical 'area' is a 64kB block, sent only once. All moving objects will have a stream of position updates (I pack multiple events in the packets and do delta compression). Scenery deformation is patches to te mesh.

I was going to suggest distribution of the deforming event. It's required for deformations that occur over an area boundary and thus need to be distributed to other area servers.

Quote:

Im going for high detail (Im sick of these games with 'pretty' mesh terrain, but are deserts otherwise) so there will be alot of local objects with simple behavior which dont react to things too far away from themselves (thus wont need to react to events outside of their own area).

Yeah, I'm thinking on including certain client-side simulations to make things a little more lively- non-critical critters like birds that play no part in the server simulation but should react to the passing of a 'real' entity by flapping away a distance can be offloaded to the clients.

Quote:

The A* server would have to get streamed the terrain mods and object position update events the same way the clients/AI nodes, and be maintain the same area surrounding the object it serves (sounds like a good use for the second core on a CPU which could share the same area data in memory). My simulation is more 2D oriented for pathfinding , but if you have a need for 3D navigation (and for octtree rebuilding) then the A* will probably require a seperate CPU.

Given the deformable terrain- you're not going to allow 'tunnelling' then? Could it be possible that entities could get trapped in deep pits? Tunelling in terms of both a scenegraph and relevance graph would require dynamic calculation of appropriate 'portals' to assist in culling. An interesting problem.

Quote:

It was my original reason for the 3D and terrain simulation in the first place -- a proper problem space where the behavioral AI can operate AND visualization of whats happening.
Complex AI like this will probably not be possible for MMORPGs (its requires the power of a CPU for each player on the server end), but I am exploreing mechanisms that could still be used as half measures to improve the current games (more reactive objects, semi-intelligent NPCs with more general behavior, better tools to allow player creation of behaviors, template mechanisms to simplify script writing)

I'd love to hear how you get on with this - reactive MMOs are really the next step for the genre in my opinion.

Winterdyne Solutions Ltd is recruiting - this thread for details!

Distributed server architecture for load balancing

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Distributed server architecture for load balancing

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines