Im making a multiplayer RTS using raknet for networking.
Im setting clients up using the same random seed and sync all commands so each client simulate their world with all units, projectiles etc in the exact same way.
(this is the age of empires 2 method:)
Problem is it's not very failproof to maintain as I grow my code. I keep track of checksum and try to make sure I only use random generating equally on all clients (so they generate the same random seeds when it comes to projectile accuracy and everything in the gamerworld).
I now again have a desync (checksum is not same on all clients) problem in some situations and it takes time to find and correct.
Am I doing this wrong? Should I deal with it in another way? How to easier find holes in my setup? (such as missing things that will lead to different client worlds). Too much work to completely switch architecture, just asking for my setup.
You'll want to be able to 'record' every simulation step of your game and have some way of playing the recorded simulations. After pinpointing where a desync occurs you can debug that simulation step by simply stepping through your code. Additionally, you should write a unit test to check for that particular case.
If that's too much (would really recommend it though), simply having the tools to record incoming and outgoing messages and being able to visually inspect them on a timeline is very nice to have (for me that only took two days to implement).
Like what Mussi said, you like want to do a replay mode of some sort first. Record the input, and replay it with the same simulation, see if it desyncs from what you did.
An additional technique that I've used is making use of #defines to input the cpp filename and line number(stuff like __LINE__ __FUNCTION___) for each RNG call, writing the result + where it was called out into a file.
Edit: Also, use a diff tool on these outputs to find the deviations.
First: Make sure you use different random number generators for game simulation versus presentation (graphics, sound, etc.) Note that some libraries you may call into may use the system rand() function, so don't rely on such libraries for synchronization, and don't use system rand() yourself. Instead, download and use a high-quality random number generator (such as the Mersenne Twister) and seed it consistently. Then create some instance of this and call it "gSimulationRandom" or something similar, and make sure that you use that instance for all simulation random numbers. Also, it's important that all players and entities are updated in the same order on all clients!
Second: What they said: You absolutely want to keep a recording of the entire game, and build tools that let you play back the recording to recover the simulation state at time T. Then you can start debugging what might happen during the simulation step when things go wrong.
Third: Separate simulation and presentation through an iron-clad interface. Push commands into the simulation, and read entity states out of the simulation, but do not share any data structures between the two. Yes, this means you have to copy some data around, but the benefit is that you know that there is no cross-talk. You might also want to build a version of the game that has no graphics at all (do not link in the graphics code) and runs only the simulation based on input commands. This will let you write unit/acceptance tests for your simulation code, and may help with debugging simulations.
Fourth: If you use floating-point, make sure to control for rounding mode, internal precision (turn off 80-bit precision on Intel CPUs!) and other such things to make sure the floating point math comes out the same across all CPUs.
back in the day i resorted to writing my own protocall to ensure no lost packets. it could work stand alone, or as data embedded in a standard transfer protcall. you could unplug the PC from the network, and the game would re-sync automatically when you reconnected. when it comes to networked computing, nothing beats having a leak-proof pipe to send packets through.