Why is that necessary?
It's not super necessary. We had both services tightly coupled together (TCP knows Records service, and Records service maintains list of TCP server nodes) that changes on one needs changes on the other. We thought it's a good idea to decouple them and have it mostly a one-way connection from TCP to the other. Part maintenance, and part that it'd be less headache if we do need to scale.
When the TCP server crashes, who is responsible for telling the record server that the user isn't connected?
If the server app crashes, we have an autorestart policy. Clients connected status will get updated when this server has been rebooted, so all clients connected to this are basically invalid.
If the hardware crashes, then we have so far no disaster plan for that. The rest of the services won't know until any attempt to ping the client is made. Any recommendation?
How? Through a periodic cleaning operation?
There isn't any periodic cleaning operation, unless not at this time. Perhaps something we should add in the near future.
Our TCP server is a dumb server that it's not trying to act smart by pinging clients once in a while, or even try to make sense of the protocol beyond authentication reason. We tried to add something like this but the biggest problem seems to be correctly detecting that clients are disconnected. We have had issues that sometimes server may even think clients are still connected. Any data transfer down the stream doesn't trigger any error until minutes later.
You probably also want to use publish/subscribe rather than checking for propagating online information -- when I log on, I subscribe to the online status of all my friends, rather than having to poll for that for each friend at some interval. Again, both for user responsiveness (polling has to be slow) and for implementation efficiency (polling uses many orders of magnitude more resources.)
Thank you for this. Will certainly keep this in mind. Once we reach hundreds of users, we would need to redesign the communication.