Debugging Multiplayer Games

half-spacer

Introduction

Hi, I’m Glenn Fiedler and welcome to the latest article in Networking for Game Programmers.

I’m a professional game developer and I’ve been developing multiplayer games for a few years now, but back when I first moved from singleplayer to multiplayer games, the first thing I noticed was, oh shit – suddenly it’s really hard to debug stuff.

Now I’ve worked on three different teams developing multiplayer games, and talked with programmers working on many others. Whenever the discussion turns to debugging it seems that there is a lot of common techniques being used by multiplayer teams, but 1) there really doesn’t seem to be anywhere that these techniques are documented, and 2) it’s rare for any one team to be using the full range of techniques available.

So I thought it would be useful write an article about multiplayer debugging. Hopefully I can give you guys some ideas and some useful techniques you can use in your own projects.

half-spacer

1. Synchronous Debugging

Stop-watch

Debugging multiplayer games. Not so hard right? I’ll just run the game in the debugger and hit break… wait, hold on… why is the other computer timing out? Crap.

This pretty much sums up my first few attempts at debugging multiplayer games.

The problem is that you have two or more asynchronous processes running on different machines. If you break into your game process on one machine you’re going to timeout your connection with the other players. I don’t know about you, but I find debugging hard enough without having a time-limit of 30 seconds to work out what’s going wrong.

You could try manually breaking into multiple instances of the game in the debugger simultaneously, but it’s fiddly as hell with just two players and it’s pretty hard to see how you could scale it up to any more without going completely nuts. A much better option is to add a synchronous debugging mode that mimics the debugger behavior you expect in singleplayer.

The trick is to send out a heartbeat packet every frame, then on each machine keep an accumulator for the amount of time that has passed since you received a heartbeat from each player. Then if any of the accumulators exceeds some threshold – like 1/10th of a second – you block until you receive a heartbeat from that player.

Here’s how it looks in code:

The key implementation detail is that heartbeat packets are still sent out while waiting for a heartbeat. This incredibly important. You must do this otherwise a player waiting on another player causes another player to wait, and so on – resulting in a cascading deadlock!

When you apply this technique to your own project, you should notice that breaking in on one machine makes the other machines pause 1/10th of a second later. Hit run and the game resumes for all players. It feels almost like singleplayer.

Phew! We can actually debug again. Well, almost…

Like all techniques this one has its shortcomings: if any player encounters a framerate spike greater than 1/10th of a second the extra delay will ripple across all players in the game, and if any player crashes or asserts out they’ll bring down the entire session. So you probably don’t want to have this on by default. Just add a debug flag or command line argument so you can turn it on when you are debugging.

half-spacer

2. Crash Dumps and Info Stacks

black-box-recorder

As programmers whenever we run the game we can just drop-in and inspect it in the debugger. Problem is, when other folks like designers and artists run the game they usually do it stand-alone, they may not even have the development tools installed on their machine.

At this point it’s absolutely essential to have a good crash reporter.

If you work at an established studio then you probably already something like this. At a minimum you need some way to see the executable version and callstack when the game crashes so you know where to get started debugging the issue on your machine, and it’s a pretty good idea to dump the crash report to both the TTY and the display so you can see at a glance what happened when looking at designer’s machine.

Now the details of developing your own crash reporter are platform specific and well worth an article in itself. But, if you are developing on the PC then here is a great place to start. Those on MacOS are in luck because it has a pretty good crash reporter built-in, and of course unix-based operating systems emit core dumps which you can debug with GDB. It’s also good practice to develop your own custom assert macro and hook this up to your crash reporter.

One really easy way to improve the quality of your crash reporting is to add an info stack. An info stack is just a stack of user-defined strings which push push and pop as you enter and exit functions in your game. The idea is to provide additional context when you crash, without having to log every single operation while the game is running normally.

So now instead of just having a callstack that tells you, yes indeed I crashed inside some sub-function while processing image X, where X is the filename of that image. When debugging issues which depend on designer or artist supplied data, or alternatively, in our situation – incoming packet data – context is everything and the info stack is worth it’s weight in gold.

Now all of this may not seem like much, but once you have detailed crash reports and info stacks something changes: instead of looking at dread at the never-ending stream of “the game crashed when I did X” bug reports, you start to look at every time the game is played in multiplayer as a delicious, sexy way to generate crash reports which you can analyze and fix. It’s almost fun. I really mean it. OK. Maybe it gets a really boring after a while, but I swear it’s fun for at least the first few crashes.

Now that you have a good quality crash reporter, common things to do at this point include setting regular team multiplayer matches in the afternoon where your team can try out all the latest stuff… and then crash it.

Another great idea is setting up client-side bots so you can stress tests your gameplay code running overnight – ideally, flushing out any instability and crashes that you added the day before.

Think about it, if a designer hits a bug or error in your multiplayer code – sure you can fix it, but in one sense you’ve already failed. It’s much better to spot that error early and fix it before anybody sees it. Automated testing combined with high quality detailed crash reporting is one way to reach this goal, unit tests and functional tests are another.

half-spacer

4. Journal Recording and Playback

tape

Crash dumps are great but sometimes no matter how hard you try, they just don’t provide enough information to work out exactly what went wrong.

Maybe the error is spread across a few frames and the crash is just the end result of something going wrong two frames earlier. Sometimes you look at the crash report and think to yourself “there is simply no way this could happen!”.

Other times the stack is corrupted and the crash report is just garbage data. And ever now and then you get one of those evil, hard to reproduce errors: the ones that occur so infrequently you only see it once every few weeks, and worst of all no matter how hard you try – you cannot reproduce it on your machine.

How exactly can you go about fixing this sort of bug?

The typical approach is to take a guess: add some logs and some asserts, get a new build out then wait another few days until somebody else reproduces the bug. If your guess was right then hopefully you have some more information in the crash dump. If you guessed wrong you repeat the process adding even more asserts and logs. With some brainpower and good luck you’ll eventually converge on the bug, but it can take a really long time going back and forth.

Wouldn’t it be nice if instead you could just record the session of somebody playing the game right up to the point where they assert or crash, then play this recording back in your debugger?

It turns out you can.

This technique is called journaling. The idea is to record all sources of non-deterministic behavior, then during playback substitute the recorded values for the actual values in-game, so that the game session plays back identically in the debugger.

Typically, you record things like frame times, player inputs, random number seeds and return codes from APIs – you can even record the set of sent and received packets. Yes, using this technique you can actually make a recording of a client or server in a 32 player game, then play it back in the debugger, with no networking required. Now that’s debugging!

Take a look at this typical main loop for a multiplayer game:

while ( true )
{
    SendPackets();

    while ( true )
    {
        int packetSize = 0;
        unsigned char packet[1024];
        if ( !ReceivePacket( packet, 
                             packetSize ) )
            break;
        assert( packetSize > 0 );
        ProcessPacket( packet, 
                       packetSize );
    }

    float frameTime = Timer::getFrameTime();
    GameUpdate( frameTime );
}

Now lets add some journaling functions. Note that these functions operate differently in recording and playback mode. When recording the values you pass in are written to the journal and flushed to the disk. In playback the data from the journal is read and replaces the value of the variable passed in.

void journal_int( int & value );
void journal_bool( bool & value );
void journal_float( float & value );
void journal_bytes( unsigned char * data, 
                    int & size );
bool journal_playback();
bool journal_recording();

Here is how to use these functions to journal the game loop:

while ( true )
{
    SendPackets();

    while ( true )
    {
        int packetSize = 0;
        unsigned char packet[1024];
        bool result = false;
        if ( !journal_playback() )
            result = ReceivePacket( packet, 
                                    packetSize );
        journal_bool( result );
        if ( !result )
             break;
        journal_bytes( packet, packetSize );
        assert( packetSize > 0 );
        ProcessPacket( packet, packetSize );
    }

    float frameTime = Timer::getFrameTime();
    journal_float( frameTime );
    GameUpdate( frameTime );
}

Notice how the receive packet function is not even called in journal playback mode. Instead, we record its result code and the packet data read in. This way when we play back the journal, we actually play back the exact set of packets received during the recorded session. This is how journaling is able to play back a multiplayer session in the debugger without any networking at all.

So you can see the value of journalling. It lets you replay your game session in the debugger from start right up to where the bug occurs. Now you can do something awesome, not only can you reproduce any bug that you can record at will – but not only this – consider, if you can make a small change to fix the bug that does not break deterministic playback, you can actually re-run the journal and verify that the bug is fixed.

Without any doubt, journaling is the most powerful multiplayer debugging technique out there. When it works that is. And unfortunately, this can be not a lot of the time. You see journaling can be very high maintenance. Why? Well you must take care to ensure the exact same set of reads and writes occur during journal record and playback or your playback gets out of sync.

In other words, you need to make sure that you record every single source of non-determinism in your application in order for the journal to playback correctly. If you don’t you’ll end up recording checkpoint values in the journal and binary searching to identify the place where you incorrectly branched left when you should have gone right. Anybody who’s developed a multiplayer game based on a deterministic networking model knows exactly how frustrating and time consuming this can be.

Another weakness with journaling is that if your game is multithreaded it can be very difficult or even impossible to journal correctly. In the case of multithreading you could try keeping all deterministic code on one main simulation thread, throwing out non-determinism affecting tasks off to worker threads, and journaling when their results come back in. Similarly, for asynchronous IO you can journal each asset as it becomes ready during recording, and during playback block until the IO completes. But in both cases note that you are going to get some sort of performance degradation during playback, and quite possibly also during recording – on top of this it’s a pretty big pain in the ass to maintain this and debug it whenever it breaks.

So journaling, it’s an incredibly powerful and fantastic debugging technology when it works but proceed with caution! Make sure you have all main operations in a single thread and you are sure you have a credible strategy for journaling each system before you start down this road.


Next: Floating Point Determinism




If you enjoyed this article please donate.

Donations offset hosting costs and encourage me to write more articles!

18 thoughts on “Debugging Multiplayer Games”

  1. My favorite “caught by a journal bug” – game was up and on the same tick that the mission lobby ends and it switches to the next mission, everyone on the server turns on their voice chat, it would crash.

    We only caught that one by virtue of long periods of stress testing + journalling. And even then it only happened once.

    Are you going to cover fuzz input testing? (ie, having a process that simulates banging on the keys for extended periods?) I’ve found that that is quite useful and a low-effort way to improve stability.

  2. timeSinceLastHeartbeat += deltaTime; -> timeSinceLastHeartbeat[i] += deltaTime;
    at the end in the first code snippet

  3. I thought you had gone on the run from the mad stalker :)

    “Notice how the socket receive function is not even called in journal playback mode”
    I realise this example code yet for the benefit or other readers. A good idea is to abstract the sending and receiving out to a router interface. These routers are created by a factory method and when a journal is required then a journalling router is attached, when not then normal router is. I feel it is good design (one of the SOLID principles) and also makes the same (or near same) code path used for single and multiplayer versions. When there is to be no network communications a “single instance” router is used to route the messages and similarly in replay mode the messages are read from a stream and passed off.
    Using the same code path is something which Dave Weinstein talked about in a past GDC lecture which I would recommend to anything thinking of creating a networked game. He does make the point that using journalling makes you attractive to the armed forces; yet why you would want this I have no idea :)

  4. I’d also add a ‘debugging’ flag for each clients, based on the result of IsDebuggerPresent() (PC), and passing that flag to all other players. If you have a client that does not respond for a lengthy period of time, but has a debugger attached to him, you can assume he’s hit a breakpoint and you should not disconnect him.

    If he has no debugger and does not send heartbeats for 30 seconds, he’s ready for a good kicking.

  5. interesting article

    just have one thing to say about “Synchronous Debugging”:

    you know, if you used blocking sockets you wouldn’t have to mess with the timeouts when breaking in the first place. and you could simply hit break as you intended. you could optimize later, when the game works.

    1. sure that’s one way to do it, but you’d find the game would hitch a bit more because it has to wait to receive a packet every frame vs. being able to have an amount of time n where it would proceed without blocking

  6. Hi Glen,

    Thanks for this fantastic set of resources. I’m reading through this and it’s all great stuff. I know you’ve recently done work on the new multiplayer God of War game so congratulations on that :-) I just wondered if you can explain your process of sending animation data back to the client, both your approach to state based animations and perhaps a more complex example where animations could be fine grained and blended in from such systems as inverse kinematic contollers.

    Best regards

    Gary Paluk

    1. Generally speaking I split animation into two types:

      1. Physics driven animation, in which I drive animation locally on all machines from the local physics simulation, or interpolated/extrapolated variables. Basically, you make animation a *function* of the motion of the character, so you only have to synchronize that.

      2. Combat driven animation, in the context of God of War, this needed to be synchronized directly. You’ll probably break animation into channels and synchronize those channels to remote views, and perhaps (depending on anim type), synchronize the time t in the animation loosely (+/- some tolerance) so that other machines see the anim at the same time in the animation relative to other things happening.

      cheers

  7. Hi!

    I have a question about ‘heartbeat packets’. What happens if some computer crashes? All the other computers will wait forever :(

    And another doubt, you stop the game while debugging, is GetFrameTime returning a constant time, then?

  8. Hi Glenn,

    Great set of articles on multiplayer programming. Very well written.

    One way to make journalling work in a multi threaded system is to wrap each access to shared state between calls like journal_serialize_enter() and journal_serialize_leave(). When recording, serialize_enter grabs a counter from an atomic counter and records it and locks other threads out until serialize_leave. On playback serialize_enter() pulls its counter out of the journal and waits until the thread with the prior counter to finish.

    Hope all is well – Brad

    1. Thanks Brad! That’s a cool technique for journaling with multiple threads. It’s a bit hard to live without isn’t it?

Leave a Reply

Glenn Fiedler's Game Development Articles and Tutorials