DISCLAIMER: I originally posted this on gamedev.stackexchange.com however I was told by one of the mods that my questions would be better suited for GDNet; and hence this post.
Introduction
I'm currently starting my very first multiplayer game and for the networking model, I'm leaning more towards a peer-to-peer model as opposed to a client-server model due to the benefit of not having to purchase a dedicated server.
However I want my game to be able to handle a large number of users concurrently (around 50-100 users) in a reasonable manner.
My Current Proof of Concept
So far, I've come up with a proof of concept that can achieve moderate speed and moderate load balancing across the network; although I'm sure it can be improved on.
For it to work, one of the peers has to be designated as a central authority (i.e. the host).
The host then receives updates from users through a star topology and then broadcasts the changes to other users through a tree topology (see image below)
As shown in the image above, half of the users are leaf nodes and therefore won't be responsible for relaying any information.
The other half of the users however will each be responsible for relaying their information to 2 other users.
The advantage of this is that packets can be sent out in parallel which in turn allows moderately high speed
Furthermore because one half of the users are responsible for relaying information to 2 other users, and the other half is not responsible for relaying any information, this achieves moderate load balancing (better than having one user send out all the packets).
Also note that since each peer is just relaying information from the host, and not modifying it themselves, we can have the host digitally sign their messages (with a timestamp) to verify their authenticity (the host would have to first distribute their public key to every single user on the network).
Problem
However one thing that this proof of concept doesn't consider is "churn" (when users join and leave the network).
For example, if a user randomly disconnects without notifying the host, then all of it's children (and it's children's children) will lose connection as well.
Possible Solution
Upon searching the web, I came across one paper which uses a "fat tree" topology. In their version, they use "heartbeats" to detect unexpected departures.
From the paper itself:
To detect unannounced departures, Nemo relies on heartbeats exchanged among the cluster’s peers. Unreported members are given a fixed time interval before being considered dead, at which point a repair algorithm is initiated. If the failed peer happens to be a leader, the tree itself must be fixed, the members of the victim’s cluster must elect the replacement leader from among themselves.
One problem I can see with this "heartbeat" method is that it generates a lot of overhead on the network; especially when you're dealing with 50-100 users.
Another Possible Solution
One possible solution that I've thought of is to have nodes "complain" (using TCP) to the host when they don't receive a certain number of packets within a certain time interval (i.e. a threshold).
When the host receives a complaint, it temporarily removes the node's parent from the network and promotes the child to it's parent's place. If the parent has not disconnected, it will eventually complain as well for not receiving any packets; in which case we can replace the parent back in it's former position. However if the parent does not complain, then we can assume that the parent disconnected and remove them from the network.
An advantage of this method is that it gets rid of freeloaders as well. However a disadvantage is that it does not have any load balancing since all the "complaints" are being sent to the host.
Conclusion
So my question is what are some popular methods employed in peer-to-peer networks for dealing with churn? What are some of their advantages/disadvantages? I'd appreciate it if people can link me to some resources as well.
TL;DR
Trying to implement a peer-to-peer massive multiplayer game (50-100 players) which has the properties of:
- minimal latency
- good load balancing
- good resilience to churn (more emphasis on this)
Additional Note
A common method used in today's "peer-to-peer" networked games is to assign one of the peers as the server and then have all the other peers act as clients. Although this allows an easy way to deal with churn, it does not have any load balancing at all (since it's basically a client-server model). And this doesn't scale well when there are 50-100 users (which is something that I'm trying to achieve).