Chris Doran, co-author of Geometric Algebra for Physicists and founder/director of Geomerics has written an **excellent** blog post on the mathematics behind quaternion compression. His approach to quaternion compression provides significantly better compression than the delta smallest three approach described in my article.

**Link**: Quaternions, Rotations and Compression

Download as PDF

]]>Hey everybody, my GDC 2015 talk is now free to view in the GDC vault!

Unfortunately the talk videos in the slides seem to have been recorded at 20fps

You can get the original slides (HD 60fps videos here). I might see if I can take the audio track from the GDC vault and splice that in with the original HD videos in the keynote.

There is also an article series covering this same material in more detail here:

http://gafferongames.com/networked-physics/introduction-to-networked-physics/

cheers

]]>Hey everybody, I have great news! My GDC 2013 talk “Virtual Go” is now free to watch in the GDC Vault

Many thanks to Meggan Scavio for making this talk free for everybody to watch!

If you would like more detail on my project to simulate a go board and stones, there is an article series here, and some source code for this project available here on github.

I hope to return to this project one day so let me know if you are still interested in it. With the latest developments in Virtual Reality these days I think this project is as relevant as ever. Still massively niche of course, but how cool would it be to be able to play go with someone on the other side of the planet on a physically simulated go board inside Virtual Reality?

]]>Some folks out there think they can do better than the compression described in the article. Now I don’t doubt for one second that better compression approaches exist, but exactly how much better is your encoding than the one described in the article? Don’t just tell me your encoding is better. PROVE IT.

It’s time for a good old fashioned walk-off.

Here are the terms of the competition. Old school rules.

If you can beat my best encoding over this dataset by 10% I’ll link to your blog post describing your compression technique from within the article. Your blog post must fully describe your encoding and include source code in the public domain that proves the result. Compression must be lossless.

The data set to be compressed is position and orientation data (smallest 3) in integer format post-quantize with delta encoding baseline. Your job, if you choose to accept it, is to encode each frame snapshot into packets using the least number of bits.

When reporting your results please include the following: number of packets encoded, average packet size in bytes, average kbps across the entire data set. To calculate this add up the number of bytes for all of your packets (if you have encoded bits not bytes, please round up to the next byte per-packet as you can’t send fractional bytes over UDP). Divide the total bytes by the number of packets to get the average packet size in bytes. To calculate bytes per-second multiply average packet size by 60 (60 snapshots per-second). To convert bytes/sec to kbps multiply bytes per-second by 8 and divide by 1000 (*not* 1024!). My best result so far is below 256kbps.

**IMPORTANT**: You __must__ encode and decode each frame individually into packets (eg. 901 cubes at a time). You may not encode the entire data set at once. You cannot assume that all packets will be received by the sender. You cannot assume that all packets will be received in-order. You may not rely on information inferred from previously encoded/decoded packets when decoding a packet unless you embed that information in the packet or can infer that information even in the presence of packet loss and out of order packets. You may not write any decompression scheme that relies on putting packets in a queue and processing them in-order as this delays realtime data delivery of snapshots and defeats the purpose. Encryption and decryption must be plausibly realtime. Static dictionaries trained on the data set are acceptable. You may not claim a win based on a dictionary trained on the same data set that is being compressed. Final results will be judged against an unreleased data set. If necessary, separate datasets for training and encoding will be provided for the final judging.

Data format is fixed records. 901 cubes per-frame. Frames 0..5 are duplicates of initial state. Start your encoding at frame 6 and encode frame n relative to frame n-6.

Each cube struct is as follows:

struct DeltaData { int orientation_largest; int orientation_a; int orientation_b; int orientation_c; int position_x; int position_y; int position_z; int interacting; };

**Best of luck!**

Thanks to everybody who attended my talk today. I had a great time presenting for you.

The final slides for my talk are available here (850MB keynote. HD video)

I have also written an article series on this subject: Networked Physics

]]>This content is password protected. To view it please enter your password below:

]]>

In the previous article we sent snapshots of the entire simulation state 10 times per-second over the network and interpolated between these snapshots to reconstruct a view of the simulation on the other side.

The problem with such a low snapshot rate is that interpolation between snapshots adds interpolation delay on top of network latency. At 10 snapshots per-second the minimum interpolation delay is 100ms and a more practical minimum considering network jitter is 150ms. If protection against one or two lost packets in a row is desired then this delay blows out to 250ms or 350ms.

This is not an acceptable amount of delay for most games. The only way to reduce this delay is to increase the snapshot send rate. Since many games update at 60fps lets try sending snapshots 60 times per-second instead of 10. Unfortunately this comes at the cost of increased bandwidth, not only because we’re sending the same amount of data more frequently, but also because with each packet sent there is packet header overhead.

This may sound obvious but at 60 packets per-second we send 6X the number of UDP/IP packet headers than we do at 10 packets per-second. This creates a bandwidth floor that we cannot reduce below. I use a rule of thumb when calculating bandwidth that packet header overhead is around 32 bytes per-packet. This is not exact but it give you an idea of the typical magnitude. Multiply this by 60 and you’ll see it’s not a trivial amount of bandwidth.

There’s nothing we can do about the packet header overhead, but we can optimize everything else in the packet. So what we’re going to do in this article is work through every possible bandwidth optimization (that I can think of at least) until we get the bandwidth under control. For this application, lets set a target bandwidth of ** 256 kilobits per-second**.

This may not seem like a lot of bandwidth and perhaps your network connection can handle much more, but understand that when you are networking a video game or physics simulation your goal is minimize latency and ensure good network conditions for the player. To achieve this it is necessary to not saturate the link, but instead to work within a very conservative amount of bandwidth that is unlikely to cause trouble.

Lets look at how much bandwidth we’re using when sending uncompressed snapshots 60 times per-second:

Clearly this is not a conservative usage of bandwidth. Where is all this bandwidth coming from? The snapshot packet is just an array of 901 cubes and nothing else so clearly the cube data is the cause of the high bandwidth. What are we sending per-cube that’s so expensive?

Each cube has the following properties:

- quat orientation:
**128 bits** - vec3 linear_velocity:
**96 bits** - vec3 position:
**96 bits** - bool interacting:
**1 bit**

That’s a total of 321 bits bits per-cube (40.125 bytes per-cube).

This may not seem like a lot, but lets do the math. The scene has 901 cubes so 901*40.125 = 36152.625 bytes of cube data per-snapshot. 60 snapshots per-second so 36152.625 * 60 = 2169157.5 bytes per-second. Add in packet header estimate: 2169157.5 + 32*60 = 2170957.5. Convert bytes per-second to megabits per-second: 2170957.5 * 8 / ( 1000 * 1000 ) = 17.38mbps. Close enough!

As you can see it all adds up. Lets get started by optimizing orientation because it’s the largest field. (When optimizing bandwidth it’s most efficient to work in the order of greater to least potential gain where possible). Many people when compressing a quaternion think: “I know. I’ll just pack it into 8.8.8.8 with one 8 bit signed integer per-component!”. Sure, that works… but with a bit of math you can get much better accuracy with fewer bits using a trick called the “smallest three”.

Since we know the quaternion represents a rotation its length must be 1, therefore x^2+y^2+z^2+w^2 = 1. We can use this identity to drop one component, reconstructing it on the other side. For example, if you sent x,y,z you could reconstruct w = sqrt( 1 – x^2 – y^2 – z^2 ). You might think you need to send a sign bit for w in case it is negative, but in fact you don’t because you can make w always positive by negating the quaternion if w is negative (you can do this because in quaternion space (x,y,z,w) and (-x,-y,-z,-w) represent the same rotation.)

You don’t want to always drop the same component due to numerical precision issues. What you want to do instead is find the largest component (abs) and encode its index using two bits [0,3] (0=x, 1=y, 2=z, 3=w) and then send that largest component index, omitting the largest component and only sending the smallest three components over the network (hence the name).

One final improvement. If v is the absolute value of the largest quaternion component, the next largest possible component value occurs when two components have the same absolute value and the other two components are zero. The length of that quaternion(v,v,0,0) is 1, therefore v^2 + v^2 = 1, 2v^2 = 1, v = 1/sqrt(2). This means that you get to encode the smallest three components in [-0.707107,+0.707107] instead of [-1,+1] giving you more precision with the same number of bits.

With this technique I’ve found that minimum sufficient precision for my simulation is 9 bits per-smallest component. This gives a result of 2 + 9 + 9 + 9 = 29 bits per-orientation (originally 128!).

What should we optimize next? It’s a tie between linear velocity and position (96 bits).

In my experience position is the harder quantity to compress so lets start with linear velocity.

To compress linear velocity we first need to bound the linear velocity components in some range so we don’t need to send the full float. I found a maximum speed of 32 meters per-second is a nice power of two and doesn’t negatively affect the player experience in the cube simulation. Since we are really only using the linear velocity as a __hint__ to improve interpolation between position sample points we can be pretty rough with compression. I found that 32 distinct values per-meter per second squared provides acceptable precision.

Linear velocity has been bounded and quantized and is now three integers in the range [-1024,1023]. That breaks down as follows: [-32,+31] (6 bits) for integer component and multiply 5 bits fraction precision. I hate messing around with sign bits so I just add 1024 to get the value in range [0,2047] and send that instead. To decode on receive just subtract 1024 to get back to signed integer range before converting to float.

11 bits per-component gives 33 bits total per-linear velocity. Just over 1/3 the original uncompressed size!

We can do better because most cubes are stationary. To take advantage just write a single bit “at rest”. If this bit is 1, then velocity is known zero and not sent. Otherwise, the compressed velocity follows after the bit (33 bits). Cubes at rest now cost just 127 bits, while cubes that are moving cost one bit more than they previously did: 159 + 1 = 160 bits.

But why are we sending linear velocity at all? In the previous article we decided to send it because it significantly improved the quality of interpolation at 10 snapshots per-second. But, now that we’re sending 60 snapshots per-second is it still necessary? As you can see below the answer is __no__. Linear interpolation is good enough at high send rates.

Now we have only position to compress. We’ll use the same trick that we used for linear velocity: bound and quantize. Most game worlds are reasonably big so I chose a position bound of [-256,255] meters in the horizontal plane (xy) and since in my cube simulation the floor is at z=0, I chose for z a range of [0,32] meters.

Now we need to work out how much precision is required. With some experimentation I found that 512 values per-meter (roughly 0.5mm precision) provides sufficient precision. This gives position x and y components in [-131072,+131071] and z components in range [0,16383]. That’s 18 bits for x, 18 bits for y and 14 bits for z giving a total of 50 bits per-position (originally 96).

This reduces our cube state to 80 bits, or just 10 bytes per-cube (4X improvement. Originally ~40 bytes per-cube).

Now that we’ve compressed position and orientation we’ve run out of simple compressions by reducing the precision of values we are sending. Any further reduction in precision results in unacceptable artifacts.

Can we optimize further?

The answer is yes, but only if we embrace a completely new technique: ** delta compression**.

Delta compression sounds mysterious. Magical. Hard. Actually, it’s not hard at all. Here’s how it works: the left side sends packets to the right like this: “This is snapshot 110 encoded relative to snapshot 100″. The snapshot being encoded relative to is called the baseline. How you do this encoding is up to you, there are many fancy tricks, but the basic, big order of magnitude win comes when you say: “Cube n in snapshot 110 is the same as the baseline. One bit: Not changed!”.

To implement delta encoding it is of course essential that the sender only encodes snapshots relative to baselines that it knows the other side has received, otherwise they cannot decode the snapshot. Therefore, to handle packet loss the receiver has to continually send “ack” packets back to the sender saying: “the most recent snapshot I have received is snapshot n”. The sender takes this most recent ack and if it is more recent than the previous ack updates the baseline snapshot to this value. The next time a packet is sent out the snapshot is encoded relative to this more recent baseline. This process happens continuously such that the steady state becomes the sender encoding snapshots relative to a baseline that is roughly RTT (round trip time) in the past.

There is one slight wrinkle: for one round trip time past initial connection the sender doesn’t have any baseline to encode against because it hasn’t received an ack from the receiver yet. I handle this by adding a single flag to the packet that says: “this snapshot is encoded relative to the initial state of the simulation” which is known on both sides. Another option if the receiver doesn’t know the initial state is to send down the initial state using a non-delta encoded path, eg. as one large data block, and once that data block has been received delta encoded snapshots are sent first relative to the initial baseline in the data block, and then eventually converging to the steady state of baselines at RTT.

As you can see above this is a big win. We can refine this approach and lock in more gains but we’re not going to get another order of magnitude improvement like this past this point. From now on we’re going to have to work pretty hard to get a number of small, cumulative gains to reach our goal of 256 kilobits per-second.

First small improvement. Each cube that isn’t sent costs 1 bit (not changed). There are 901 cubes so we send 901 bits in each packet even if no cubes have changed. At 60 packets per-second this adds up to 54kbps of bandwidth. Seeing as there are usually significantly less than 901 changed cubes per-snapshot in the common case, we can reduce bandwidth by sending only changed cubes with a cube index [0,900] identifying which cube it is. To do this we need to add a 10 bit index per-cube to identify it.

There is a cross-over point where it is actually more expensive to send indices than not-changed bits. With 10 bit indices, the cost of indexing is 10*n bits. Therefore it’s more efficient to use indices if we are sending 90 cubes or less (900 bits). We can evaluate this per-snapshot and send a single bit in the header indicating which encoding we are using: 0 = indexing, 1 = changed bits. This way we can use the most efficient encoding for the number of changed cubes in the snapshot.

This reduces the steady state bandwidth when all objects are stationary to around 15 kilobits per-second. This bandwidth is composed entirely of our own packet header (uint16 sequence, uint16 base, bool initial) plus IP and UDP headers (28 bytes).

Next small gain. What if we encoded the cube index relative to the previous cube index? Since we are iterating across and sending changed cube indices in-order: cube 0, cube 10, cube 11, 50, 52, 55 and so on we could easily encode the 2nd and remaining cube indices relative to the previous changed index, e.g.: +10, +1, +39, +2, +3. If we are smart about how we encode this index offset we should be able to, on average, represent a cube index with less than 10 bits.

The best encoding depends on the set of objects you interact with. If you spend a lot of time moving horizontally while blowing cubes from the initial cube grid then you hit lots of +1s. If you move vertically from initial state you hit lots of +30s (sqrt(900)). What we need then is a general purpose encoding capable of representing statistically common index offsets with less bits.

After a small amount of experimentation I came up with this simple encoding:

- [1,8] => 1 + 3 (4 bits)
- [9,40] => 1 + 1 + 5 (7 bits)
- [41,900] => 1 + 1 + 10 (12 bits)

Notice how large relative offsets are actually more expensive than 10 bits. It’s a statistical game. The bet is that we’re going to get a much larger number of small offsets so that the win there cancels out the increased cost of large offsets. It works. With this encoding I was able to get an average of 5.5 bits per-relative index.

Now we have a slight problem. We can no longer easily determine whether changed bits or relative indices are the best encoding. The solution I used is to run through a mock encoding of all changed cubes on packet write and count the number of bits required to encode relative indices. If the number of bits required is larger than 901, fallback to changed bits.

Next small improvement. Encoding position relative to (offset from) the baseline position. Here there are a lot of different options. You can just do the obvious thing, eg. 1 bit relative position, and then say 8-10 bits per-component if all components have deltas within the range provided by those bits, otherwise send the absolute position (50 bits).

This gives a decent encoding but we can do better. If you think about it then there will be situations where one position component is large but the others are small. It would be nice if we could take advantage of this and send these small components using less bits.

It’s a statistical game and the best selection of small and large ranges per-component depend on the data set. I couldn’t really tell looking at a noisy bandwidth meter if I was making any gains so I captured the position vs. position base data set and wrote it to a text file for analysis. The format is x,y,z,base_x,base_y,base_z with one cube per-line. The goal is to encode x,y,z relative to base x,y,z for each line. If you are interested, you can download this data set here.

I wrote a short ruby script to find the best encoding with a greedy search. The best bit-packed encoding I found for the data set works like this: 1 bit small per delta component followed by 3 bits if small [-16,+15] range, otherwise the delta component is in [-256,+255] range and is sent with 9 bits. If any component delta values are outside the large range fallback to absolute position. Using this encoding I was able to obtain on average 26.1 bits for each changed position.

Next I figured that relative orientation would be a similar easy big win. Problem is that unlike position where the range of the position offset is quite small relative to the total position space, the change in orientation in 100ms is a much larger percentage of total quaternion space.

I tried a bunch of stuff without good results. I tried encoding the 4D vector of the delta orientation directly and recomposing the largest component post delta using the same trick as smallest 3. I tried calculating the relative quaternion between orientation and base orientation, and since I knew that w would be large for this (rotation relative to identity) I could avoid sending 2 bits to identify the largest component, but in turn would need to send one bit for the sign of w because I don’t want to negate the quaternion. The best compression I could find using this scheme was only 90% of the smallest three. Not very good.

I was about to give up but I run some analysis over the smallest three representation. I found that 90% of orientations in the smallest three format had the same largest component index as their base orientation 100ms ago. This meant that it could be profitable to delta encode the smallest three format directly. What’s more I found that there would be no additional precision loss with this method when reconstructing the orientation from its base. I exported the quaternion values from a typical run as a data set in smallest three format (available here) and got to work trying the same multi-level small/large range per-component greedy search that I used for position.

The best encoding found was: 5-8, meaning [-16,+15] small and [-128,+127] large. One final thing: as with position the large range can be extended a bit further by knowing that if the component value is not small the value cannot be in the [-16,+15] range. I leave the calculation of how to do this as an exercise for the reader. Be careful not to collapse two values onto zero.

The end result is an average of 23.3 bits per-relative quaternion. That’s 80.3% of the absolute smallest three.

That’s just about it but there is one small win left. Doing one final analysis pass over the position and orientation data sets I noticed that 5% of positions are unchanged from the base position after being quantized to 0.5mm resolution, and 5% of orientations in smallest three format are also unchanged from base.

These two probabilities are mutually exclusive, because if both are the same then the cube would be unchanged and therefore not sent, meaning a small statistical win exists for 10% of cube state if we send one bit for position changing, and one bit for orientation changing. Yes, 90% of cubes have 2 bits overhead added, but the 10% of cubes that save 20+ bits by sending 2 bits instead of 23.3 bit orientation or 26.1 bits position make up for that providing a small overall win of roughly 2 bits per-cube.

There are many options for bandwidth optimization and with a bit of work the seemingly impossible is in fact possible. We just took 20mbit down to less than 0.25mbit on average. That’s less than 1.25% of original uncompressed bandwidth!

That’s about as far as I can take it using traditional hand-rolled bit-packing techniques.

You can find source code for my implementation of all compression techniques mentioned in this article here.

It’s possible to get even better compression using a different approach. Bit-packing is inefficient because not all bit values have equal probability of 0 vs 1. No matter how hard you tune your bit-packer a context aware arithmetic encoding can beat your result by more accurately modeling the probability of values that occur in your data set. This implementation by Fabian Giesen beat my best bit-packed result by 25%.

It’s also possible to get a much better result for delta encoded orientations using the previous baseline orientation values to estimate angular velocity and predict future orientations rather than delta encoding the smallest three representation directly. Chris Doran from Geomerics wrote an excellent article exploring the mathematics of quaternion compression that is worth reading.

Up next: State Synchronization

If you are enjoying the networked physics article series, please consider supporting my work with a patreon donation. **Patreon supporters get early access to unpublished articles!**

If you have enjoyed the articles on this site over the past ten years, please consider supporting gafferongames.com with a small patreon donation. It takes a only small amount of money to host this website but a whole truckload of effort to research and write the articles that make this site worth reading. Many of these articles are only possible because I spend large portions of my spare time, holidays and time off work and between jobs researching and developing techniques to write about here.

My aim on this site is to always write clearly and explain game networking concepts that you won’t read about anywhere else. If you have read any of my game networking articles over the past 10 years I’m sure you’ll agree. You’ll find quality game networking information here that simply isn’t available anywhere else. Many people have learned game network programming with the articles posted on this website and this is something I’m very proud of. Well done!

Lately I’ve started including videos with my articles to explain concepts. It costs money to host these videos at high quality and a great deal of work, effort and so on to research and develop the game networking techniques I write about on this website. You can trust that I *only* ever write about things which I have implemented and this implementation takes a LOT of time. This is why my articles are so full of concrete examples vs. theoretical explanation but also it’s why it’s so much work to write articles for this website!

If you have enjoyed the articles on this website, please consider showing your support for my work by making a small patreon donation.

**Networked Physics article series resumes once my hosting costs for 2015 are covered!**

*Wishing all my readers a happy new year and a great 2015. I hope to see you with more articles in the new year!*

In this article we’ll spend some time exploring the physics simulation we’re going to network.

First we need an object controlled by the player. Here I’ve setup a simple simulation of a cube in the open source physics engine ODE. The player can use the arrow keys to make the cube move around by applying force at its center of mass. The physics simulation takes this linear motion and calculates friction as the cube collides with the ground, inducing a rolling and tumbling motion.

This tumbling is quite intentional and is why I chose a cube for this simulation instead a sphere. I want this complex, unpredictable motion because rigid bodies in general move in interesting non-linear ways according to their shape! It’s not possible to accurately predict the motion of this tumbling cube using a simple linear extrapolation or the ballistic equations of motion. If you want to know where a physics object is at a future time, you have to run the whole physics simulation with collision detection, collision response and friction in order to find out.

Moving forward. Networking a physics simulation is just __too easy__ if there is only one object interacting with a static world. It starts to get interesting when the player controls an object that interacts with other physically simulated objects, especially if those objects push back and affect the motion of the player.

Here I’ve added a grid of 900 small cubes to the simulation so the player has something to interact with. When the player interacts with a cube it turns red. When a non-player cube comes to rest it returns to grey (non-interacting). Interactions aren’t just direct – so if a cube is red, it turns other cubes it interacts with red as well. This way player interactions fan-out recursively covering all affected objects.

It’s cool to roll around and push cubes, but I really wanted a way to interact with lots of object and push them around. What I came up with is this:

To implement this I raycast below the cube to find the intersection point with the ground below the center of mass of the player cube, then apply a spring force (Hooke’s law) to the player cube such that it floats in the air at a certain height above this point. Then all non-player cubes within a certain distance of that intersection point have a force applied proportional to their distance from this point and away from it, so they are pushed away from the cube like it’s a leaf blower.

I also wanted a very complex coupled motion between the player and non-player cubes where the player interacts with lots of other physics objects very closely, such that the player and the objects its interacting with become a single system, a group of rigid bodies joined together by constraints. To implement this I thought it would be cool if the player could roll around and create a ball of cubes, like in one of my favorite games Katamari Damacy

To implement this effect cubes within a certain distance of the player have a force applied towards the center of the cube. Note that these cubes remain physically simulated while in the katamari ball, they are not just “stuck” to the player like in the original game. This means that the cubes in the katamari are continually pushing and interacting with each other and the player cube via collision constraints, a very difficult situation for networked physics.

That’s it for the exploration of the physics simulation. Lets get busy networking it!

Up first: Deterministic Lockstep

If you enjoyed this article please consider making a small donation. __Donations encourage me to write more articles!__