Floating Point Determinism

Introduction

Lately I’ve been doing some research into networking game physics simulations via deterministic lockstep methods.

The basic idea is that instead of synchronizing the state of physics objects directly by sending the positions, orientations, velocities etc. over the network, one could synchronize the simulation implicitly by sending just the player inputs.

This is a very attractive synchronization strategy because the amount of network traffic depends on the size of the player inputs instead of the amount of physics state in the world. In fact, this strategy has been used for many years in RTS games for precisely this reason; with thousands and thousands of units on the map, they simply have too much state to send over the network.

Perhaps you have a complex physics simulation with lots of rigid body state, or a cloth or soft body simulation which needs to stay perfectly in sync across two machines because it is gameplay affecting, but you cannot afford to send all the state. It is clear that the only possible solution in this situation is to attempt a deterministic networking strategy.

But we run into a problem. Physics simulations use floating point calculations, and for one reason or another it is considered very difficult to get exactly the same result from floating point calculations on two different machines. People even report different results on the same machine from run to run, and between debug and release builds. Other folks say that AMDs give different results to Intel machines, and that SSE results are different from x87. What exactly is going on? Are floating point calculations deterministic or not?

Unfortunately, the answer is not a simple “yes” or “no” but a resoundingly limp “maybe?”

Here is what I have discovered so far:

1. If your physics simulation is itself deterministic, with a bit of work you should be able to get it to play back a replay of recorded inputs on the same machine and get the same result.

2. It is possible to get deterministic results for floating calculations across multiple computers provided you use an executable built with the same compiler, run on machines with the same architecture, and perform some platform-specific tricks.

3. It is incredibly naive to write arbitrary floating point code in C or C++ and expect it to give exactly the same result across different compilers or architectures.

4. However with a good deal of work you may be able to coax exactly the same floating point results out of different compilers or different machine architectures by using your compilers “strict” IEEE 754 compliant mode and restricting the set of floating point operations you use. This typically results in significantly lower floating point performance.

If you would like to debate these points or add your own nuance, please write a comment! I consider this question by no means settled and am very interested in other peoples experiences with deterministic floating point simulations and exactly reproducible floating point calculations. Please contact me especially if you have managed to get binary exact results across different architectures and compilers in real world situations.

half-spacer

Here are the resources that I have discovered in my search so far…

half-spacer

The technology we license to various customers is based on determinism of floating point (in 64-bit mode, even) and has worked that way since the year 2000.

As long as you stick to a single compiler, and a single CPU instruction set, it is possible to make floating point fully deterministic. The specifics vary by platform (i e, different between x86, x64 and PPC).

You have to make sure that the internal precision is set to 64 bits (not 80, because only Intel implements that), and that the rounding mode is consistent. Furthermore, you have to check this after calls to external DLLs, because many DLLs (Direct3D, printer drivers, sound libraries, etc) will change the precision or rounding mode without setting it back.

The ISA is IEEE compliant. If your x87 implementation isn’t IEEE, it’s not x87.

Also, you can’t use SSE or SSE2 for floating point, because it’s too under-specified to be deterministic.

Jon Watte, GameDev.net forums
http://www.gamedev.net/community/forums/topic.asp?topic_id=499435

half-spacer
half-spacer

I work at Gas Powered Games and i can tell you first hand that floating point math is deterministic. You just need the same instruction set and compiler and of course the user’s processor adheres to the IEEE754 standard, which includes all of our PC and 360 customers. The engine that runs DemiGod, Supreme Commander 1 and 2 rely upon the IEEE754 standard. Not to mention probably all other RTS peer to peer games in the market. As soon as you have a peer to peer network game where each client broadcasts what command they are doing on what ‘tick’ number and rely on the client computer to figure out the simulation/physical details your going to rely on the determinism of the floating point processor.

At app startup time we call:

_controlfp(_PC_24, _MCW_PC)
_controlfp(_RC_NEAR, _MCW_RC)

Also, every tick we assert that these fpu settings are still set:

gpAssert( (_controlfp(0, 0) & _MCW_PC) == _PC_24 );
gpAssert( (_controlfp(0, 0) & _MCW_RC) == _RC_NEAR );

There are some MS API functions that can change the fpu model on you so you need to manually enforce the fpu mode after those calls to ensure the fpu stays the same across machines. The assert is there to catch if anyone has buggered the fpu mode.

FYI We have the compiler floating point model set to Fast /fp:fast ( but its not a requirement )

We have never had a problem with the IEEE standard across any PC cpu AMD and Intel with this approach. None of our SupCom or Demigod customers have had problems with their machines either, and we are talking over 1 million customers here (supcom1 + expansion pack). We would have heard if there was a problem with the fpu not having the same results as replays or multiplayer mode wouldn’t work at all.

We did however have problems when using some physics APIs because their code did not have determinism or reproducibility in mind. For example some physics APIS have solvers that take X number of iterations when solving where X can be lower with faster CPUs.

Elijah, Gas Powered Games
http://www.box2d.org/forum/viewtopic.php?f=3&t=1800

half-spacer
half-spacer

If you store replays as controller inputs, they cannot be played back on machines with different CPU architectures, compilers, or optimization settings. In MotoGP, this meant we could not share saved replays between Xbox and PC. It also meant that if we saved a replay from a debug build of the game, it would not work in release builds, or vice versa. This is not always a problem (we never shipped debug builds, after all), but if we ever released a patch, we had to build it using the exact same compiler as the original game. If the compiler had been updated since our original release, and we built a patch using the newer compiler, this could change things around enough that replays saved by the original game would no longer play back correctly.

This is madness! Why don’t we make all hardware work the same? Well, we could, if we didn’t care about performance. We could say “hey Mr. Hardware Guy, forget about your crazy fused multiply-add instructions and just give us a basic IEEE implementation”, and “hey Compiler Dude, please don’t bother trying to optimize our code”. That way our programs would run consistently slowly everywhere :-)

Shawn Hargreaves, MSDN Blog
http://blogs.msdn.com/shawnhar/archive/2009/03/25/is-floating-point-math-deterministic.aspx

half-spacer
half-spacer

“Battlezone 2 used a lockstep networking model requiring absolutely identical results on every client, down to the least-significant bit of the mantissa, or the simulations would start to diverge. While this was difficult to achieve, it meant we only needed to send user input across the network; all other game state could be computed locally. During development, we discovered that AMD and Intel processors produced slightly different results for trancendental functions (sin, cos, tan, and their inverses), so we had to wrap them in non-optimized function calls to force the compiler to leave them at single-precision. That was enough to make AMD and Intel processors consistent, but it was definitely a learning experience.

Ken Miller, Pandemic Studios
http://www.box2d.org/forum/viewtopic.php?f=4&t=175

half-spacer
half-spacer

… In FSW1 when desync is detected in player would be instantly killed by “magic sniper”. :) All that stuff was fixed in FSW2. We just ran precise FP and used Havok FPU libs instead SIMD on PC. Also integer modulo is problem too because C++ standard says it’s “implementation defined” (in case when multiple compilers/platforms are used). In general I liked tools for lockstep we developed, finding desyncs in code on FSW2 was trivial.

Branimir Karadžić, Pandemic Studios

http://www.google.com/buzz/100111796601236342885/8hDZ655S6x3/Floating-Point-Determinism-Gaffer-on-Games

half-spacer
half-spacer

I know three main sources of floating point inconsistency pain:

Algebraic compiler optimizations
“Complex” instructions like multiply-accumulate or sine
x86-specific pain not available on any other platform; not that ~100% of non-embedded devices is a small market share for a pain.

The good news is that most pain comes from item 3 which can be more or less solved automatically. For the purpose of decision making (”should we invest energy into FP consistency or is it futile?”), I’d say that it’s not futile and if you can cite actual benefits you’d get from consistency, then it’s worth the (continuous) effort.

Summary: use SSE2 or SSE, and if you can’t, configure the FP CSR to use 64b intermediates and avoid 32b floats. Even the latter solution works passably in practice, as long as everybody is aware of it.

Yossi Kreinin, Consistency: how to defeat the purpose of IEEE floating point
http://www.yosefk.com/blog/consistency-how-to-defeat-the-purpose-of-ieee-floating-point.html

half-spacer
half-spacer

The short answer is that FP calculations are entirely deterministic, as per the IEEE Floating Point Standard, but that doesn’t mean they’re entirely reproducible across machines, compilers, OS’s, etc.

The long answer to these questions and more can be found in what is probably the best reference on floating point, David Goldberg’s What Every Computer Scientist Should Know About Floating Point Arithmetic. Skip to the section on the IEEE standard for the key details.

Finally, if you are doing the same sequence of floating point calculations on the same initial inputs, then things should be replayable exactly just fine. The exact sequence can change depending on your compiler/os/standard library, so you might get some small errors this way.

Where you usually run into problems in floating point is if you have a numerically unstable method and you start with FP inputs that are approximately the same but not quite. If your method’s stable, you should be able to guarantee reproducibility within some tolerance. If you want more detail than this, then take a look at Goldberg’s FP article linked above or pick up an intro text on numerical analysis.

Todd Gamblin, Stack Overflow
http://stackoverflow.com/questions/968435/what-could-cause-a-deterministic-process-to-generate-floating-point-errors

half-spacer
half-spacer

The C++ standard does not specify a binary representation for the floating-point types float, double and long double. Although not required by the standard, the implementation of floating point arithmetic used by most C++ compilers conforms to a standard, IEEE 754-1985, at least for types float and double. This is directly related to the fact that the floating point units of modern CPUs also support this standard. The IEEE 754 standard specifies the binary format for floating point numbers, as well as the semantics for floating point operations. Nevertheless, the degree to which the various compilers implement all the features of IEEE 754 varies. This creates various pitfalls for anyone writing portable floating-point code in C++.

Günter Obiltschnig, Cross-Platform Issues with Floating-Point arithmetics in C++
http://www.appinf.com/download/FPIssues.pdf

half-spacer
half-spacer

Floating-point computations are strongly dependent on the FPU hardware implementation, the compiler and its optimizations, and the system mathematical library (libm). Experiments are usually reproducible only on the same machine with the same system library and the same compiler using the same options.

STREFLOP Library
http://nicolas.brodu.numerimoire.net/en/programmation/streflop/index.html

half-spacer
half-spacer

Floating Point (FP) Programming Objectives:

Accuracy – Produce results that are “close” to the correct value

Reproducibility – Produce consistent results from one run to the next. From one set of build options to another. From one compiler to another. From one platform to another.

Performance – Produce the most efficient code possible.

These options usually conflict! Judicious use of compiler options lets you control the tradeoffs.

Intel C++ Compiler: Floating Point Consistency
http://www.nccs.nasa.gov/images/FloatingPoint%5Fconsistency.pdf.

half-spacer
half-spacer

If strict reproducibility and consistency are important do not change the floating point environment without also using either fp-model strict (Linux or Mac OS*) or /fp:strict (Windows*) option or pragma fenv_access.

Intel C++ Compiler Manual

http://cache-www.intel.com/cd/00/00/34/76/347605_347605.pdf

half-spacer
half-spacer

Under the fp:strict mode, the compiler never performs any optimizations that perturb the accuracy of floating-point computations. The compiler will always round correctly at assignments, typecasts and function calls, and intermediate rounding will be consistently performed at the same precision as the FPU registers. Floating-point exception semantics and FPU environment sensitivity are enabled by default. Certain optimizations, such as contractions, are disabled because the compiler cannot guarantee correctness in every case.

Microsoft Visual C++ Floating-Point Optimization
http://msdn.microsoft.com/en-us/library/aa289157(VS.71).aspx#floapoint_topic4

half-spacer
half-spacer

Please note that the results of floating point calculations will likely not be exactly the same between PowerPC and Intel, because the PowerPC scalar and vector FPU cores are designed around a fused multiply add operation. The Intel chips have separate multiplier and adder, meaning that those operations must be done separately. This means that for some steps in a calculation, the Intel CPU may incur an extra rounding step, which may introduce 1/2 ulp errors at the multiplication stage in the calculation.

Apple Developer Support
http://developer.apple.com/hardwaredrivers/ve/sse.html

half-spacer
half-spacer

For all of the instructions that are IEEE operations (*,+,-,/,sqrt, compares, regardless of whether they are SSE or x87), they will produce the same results across platforms with the same control settings (same precision control and rounding modes, flush to zero, etc.) and inputs. This is true for both 32-bit and 64-bit processors… On the x87 side, the transcendental instructions like, fsin, fcos, etc. could produce slightly different answers across implementations. They are specified with a relative error that is guaranteed, but not bit-for-bit accuracy.

Intel Software Network Support
http://software.intel.com/en-us/forums/showthread.php?t=48339

half-spacer
half-spacer

I’m concerned about the possible differences between hardware implementations of IEEE-754. I already know about the problem of programming languages introducing subtle differences between what is written in the source code and what is actually executed at the assembly level. [Mon08] Now, I’m interested in differences between, say, Intel/SSE and PowerPC at the level of individual instructions.

D. Monniaux on IEEE 754 mailing list
http://grouper.ieee.org/groups/754/email/msg03864.html

half-spacer
half-spacer

One must … avoid the non-754 instructions that are becoming more prevalent for inverse and inverse sqrt that don’t round correctly or even consistently from one implementation to another, as well as the x87 transcendental operations which are necessarily implemented differently by AMD and Intel.

David Hough on 754 IEEE mailing list
http://grouper.ieee.org/groups/754/email/msg03867.html

half-spacer
half-spacer

Yes, getting reproducible results IS possible. But you CAN’T do it without defining a programming methodology intended to deliver that property. And that has FAR more drastic consequences than any of its proponents admit – in particular, it is effectively incompatible with most forms of parallelism.

Nick Maclaren on 754 IEEE mailing list
http://grouper.ieee.org/groups/754/email/msg03872.html

half-spacer
half-spacer

If we are talking practicabilities, then things are very different, and expecting repeatable results in real programs is crying for the moon. But we have been there before, and let’s not go there again.

Nick Maclaren on 754 IEEE mailing list
http://grouper.ieee.org/groups/754/email/msg03862.html

half-spacer
half-spacer

The IEEE 754-1985 allowed many variations in implementations (such as the encoding of some values and the detection of certain exceptions). IEEE 754-2008 has tightened up many of these, but a few variations still remain (especially for binary formats). The reproducibility clause recommends that language standards should provide a means to write reproducible programs (i.e., programs that will produce the same result in all implementations of a language), and describes what needs to be done to achieve reproducible results.

Wikipedia Page on IEEE 754-2008 standard
http://en.wikipedia.org/wiki/IEEE_754-2008#Reproducibility

half-spacer
half-spacer

If one wants semantics almost exactly faithful to strict IEEE-754 single or double precision computations in round-to-nearest mode, including with respect to overflow and underflow conditions, one can use, at the same time, limitation of precision and options and programming style that force operands to be systematically written to memory between floating-point operations. This incurs some performance loss; furthermore, there will still be slight discrepancy due to double rounding on underflow.

A simpler solution for current personal computers is simply to force the compiler to use the SSE unit for computations on IEEE-754 types; however, most embedded systems using IA32 microprocessors or microcontrollers do not use processors equipped with this unit.

David Monniaux, The pitfalls of verifying floating-point computations
http://hal.archives-ouvertes.fr/docs/00/28/14/29/PDF/floating-point-article.pdf

half-spacer
half-spacer

6. REPRODUCIBILITY

Even under the 1985 version of IEEE-754, if two implementations of the standard executed an operation on the same data, under the same rounding mode and default exception handling, the result of the operation would be identical. The new standard tries to go further to describe when a program will produce identical floating point results on different implementations. The operations described in the standard are all reproducible operations.

The recommended operations, such as library functions or reduction operators are not reproducible, because they are not required in all implementations. Likewise dependence on the underflow and inexact flags is not reproducible because two different methods of treating underflow are allowed to preserve conformance between IEEE-754(1985) and IEEE-754(2008). The rounding modes are reproducible attributes. Optional attributes are not reproducible.

The use of value-changing optimizations is to be avoided for reproducibility. This includes use of the associative and disributative laws, and automatic generation of fused multiply-add operations when the programmer did not explicitly use that operator.

Peter Markstein, The New IEEE Standard for Floating Point Arithmetic
http://drops.dagstuhl.de/opus/volltexte/2008/1448/pdf/08021.MarksteinPeter.ExtAbstract.1448.pdf

half-spacer
half-spacer

Unfortunately, the IEEE standard does not guarantee that the same program will deliver identical results on all conforming systems. Most programs will actually produce different results on different systems for a variety of reasons. For one, most programs involve the conversion of numbers between decimal and binary formats, and the IEEE standard does not completely specify the accuracy with which such conversions must be performed. For another, many programs use elementary functions supplied by a system library, and the standard doesn’t specify these functions at all. Of course, most programmers know that these features lie beyond the scope of the IEEE standard.

Many programmers may not realize that even a program that uses only the numeric formats and operations prescribed by the IEEE standard can compute different results on different systems. In fact, the authors of the standard intended to allow different implementations to obtain different results. Their intent is evident in the definition of the term destination in the IEEE 754 standard: “A destination may be either explicitly designated by the user or implicitly supplied by the system (for example, intermediate results in subexpressions or arguments for procedures). Some languages place the results of intermediate calculations in destinations beyond the user’s control. Nonetheless, this standard defines the result of an operation in terms of that destination’s format and the operands’ values.” (IEEE 754-1985, p. 7) In other words, the IEEE standard requires that each result be rounded correctly to the precision of the destination into which it will be placed, but the standard does not require that the precision of that destination be determined by a user’s program. Thus, different systems may deliver their results to destinations with different precisions, causing the same program to produce different results (sometimes dramatically so), even though those systems all conform to the standard.

Differences Among IEEE 754 Implementations
http://docs.sun.com/source/806-3568/ncg_goldberg.html#3098

half-spacer
half-spacer


Next: What Every Game Programmer Needs To Know About Game Networking




If you enjoyed this article please donate.

Donations offset hosting costs and encourage me to write more articles!

54 thoughts on “Floating Point Determinism”

  1. It’s certainly possible but not likely given that many, many games have shipped on the PC which rely on floating point determinism or they simply would not work — eg. Supreme Commander, Full Spectrum Warrior 1&2, Battlezone 2 all the games from Relic etc…

  2. True indeed; at the time I was thinking of an anecdote told to me a couple of years back by a developer working on the Xbox version of a PS2 title in which he swore up and down that sometimes a value would go in to an FPU calculation the same on two Xboxes and come out differently on each; he was at the point of dumping the contents of memory *every frame* to try and catch this thing, and I don’t think he ever did (this could of course just have been an instance of bad hardware, like a busted devkit, though I assume he eliminated that possibility. I was too busy recoiling in horror to press for further details).

    I suppose my point is; in my experience, I haven’t encountered a case where fp behavior didn’t hold to be identical where I thought it should have been (namely, where the cpu arch/fpu flags/compiler settings were the same). Unfortunately because other problems manifest themselves as diverging fp results (reads from uninitialized stack variables causing different code paths being one that sticks vividly in my memory), this catch-all of “magical mystery floating point calculations” gets the blame*. Then on the other hand, I’ve seen things that could only possibly have been explained by bad hardware. So you have that to contend with too.

    * Well, unless you work for a physics middleware vendor. In which case, *you* get the blame. Not that I’m bitter!

  3. The Croquet project (http://www.opencroquet.org), which is a platform for a shared 3D VR environment along the lines of Second Life, does this kind of lockstep determinism by running on a VM (Squeak) which is crafted to be bit-identical across all supported architectures. Croquet itself is hardly production-quality software (and probably never will be), but a fast VM designed and tested to be bit-identical on various platforms is probably the easiest way to achive this holy grail, as all clients can be kept in sync merely by replicating the input commands from one client to all connected clients.

    Their VM approach also allows the server to synchronize newly-connected clients simply by pushing the current state of the VM (as a blob of binary data) into the client’s virtual memory. Neat idea, but wholly dependent on implementing the VM a certain way, and probably too much overhead for real gaming.

    1. Yeah that’s pretty cool, I believe that the networking for old arcade roms like SF2 works on a similar principle — using the emulator as the VM, trick is it also hides latency by forking the VM and playing user inputs up to present time. So you have the synced VM and the predictive VM. Synced VM advances only when inputs arrive from all players , each time you want to render, fork the synced VM and step it up to present time using local player inputs.

  4. I worked on NBA Street Online for PC whose networking code depends on floating point determinism.
    I have to say that it’s not a trivial task and it’s sometimes very hard and time consuming to figure out which fp operation is at fault and where it is called from. If the game’s simulation depends on some other libraries, it is even harder to figure out where the problem is.

    As mentioned before, set the fp mode and check that it is not modified by some instructions.
    SSE is fast but not all instructions are deterministic. Can’t clearly remember the list of instructions but usually those are the inverse instructions like 1/sqrt 1/x etc. that are the culprits.

    Just my personal experience.

  5. Btw, did you know that double is about 5 times faster than float in C++?
    Why don’t people use always double?
    Make a test where you add a small increment 10 billion times to a double number and you’ll see that it’s indeed about 5 times faster than using float.

    1. How can this be the case, a double requires twice as much memory to read and write, and considering SSE instructions — you can only process half as much in the same number of cycles

      1. I’ve seen this, only when the compiler is dumbed down to keep converting between floats and doubles and vice versa. Any printf-like (variable arg) function would convert float to double. Any constant without ‘f’ at the end might end up as double. On top of that lots of library math functions do return doubles, that get converted to floats. This is what kills your performance.

        And the easy solution is to peek once in a while in the assembly, if you see something converting (cvtss2sd or can’t remember right now the x87 one) then that would be it.

        Apart from that I prefer doubles now, but mainly for tools or future editors. Still floats, especially floats have their best when used for runtime game data (vertices being the most typical example)

  6. May be a stupid question but why must physics simulations use floating points? Couldn’t you use integers and a different scaling factor? If the smallest unit of measurement in your simulation is 1 micro meter then your world could still be (2^64/10^6) meters with a uint64_t?

    1. Repeated multiplication gets numbers very large very fast. But, if you are certain you have the dynamic range in a 64bit int – then sure, go for it

      1. What is the common variable that gets overflowed by multiplication… position?

        I need deterministic physics shared by many machines, so I might give it a try and would like to know where it’s going to explode :)

        By the way, very nice articles.

        1. Yes, but only when you are doing calculations on it, eg. collision detection, physics solver for constraints etc. It doesn’t just magically overflow because it’s easy to bound position in some min/max — it’s only when you do calculations that depend on that position that you have to be careful for overflow (ie. all the time) :)

          1. If you know bounds on the approximate magnitude of all the values before hand, you can apply shifts to avoid overflow (or use C++ templates to make it automatic). I’ve made a physics intensive fixed point game in C++ before, and it works pretty well. The main problem with this approach is that the code is no longer generic, so it’s only useful if you know exactly what you’ll be doing ahead of time.

  7. What are the influence of multi-threading and parallel-processing on determinism? That’s something I am not entirely familiar with, since I’ve only experienced lock-step on single-threaded applications (at least the game code and physics part).

    All I know is that if you loose determinism, no matter how slight, you loose it forever. Very difficult to get right and a pain to debug. However, its flaws can be a blessing in disguise, since you can reproduce the discrepancy over and over, which sometimes ‘can’ help.

    The other advantage of determinism is replay systems, where you can reproduce game bugs in a consistent manner, with added caveat that you usually need the original build to work with, as any change to code can break the replay.

  8. We dealt with this back in 1996 on “Myth: The Fallen Lords.” After several attempts to get the then-current version of MSVC to behave according to 754 and match the PPC version, we moved to fixed-point. We even later hit precision problems in the FP code used to initialize an 8-bit table of sin/cos values we used in our fixed-point sin/cos routines.

  9. I am currently developing a game that hinges on making sure that everything runs deterministically. My particular struggles have been with trying to get the bullet physics engine to not diverge across network synchronization and game state serializations. It can be done with some specific setup and coaxing but I have found that there are a lot of really subtle things internal to bullet like cache initialization values and object storage order that can break determinism.

    My solution to a lot of these problems has been to observe when the physics simulation has “settled down” and very quickly and quietly shut down the entire physics engine, bring it back up again and place all the objects where they had settled. It is at these “settled” points that things like network synchronization and serialization are allowed to take place if needed. Fortunately the user never notices this happening because nothing is moving in the simulation at the time the reboot takes place. This really only works because my game happens to be turn-based.

    I recently wrote a brief overview on determinism as it has related to the development of my game: http://www.aorensoftware.com/blog/2011/01/28/determinism-in-games/

  10. I have a few questions

    There seem to be some conflicting information about SSE, in the comments and resources above. Some of them suggests that you should turn SSE on, others says that some SSE instructions are non-deterministic.

    If that’s true, in which way are they not deterministic? Are they non-deterministic in relation to normal floating point instructions, in which case it shouldn’t really matter for most uses, as long as the compiled exe is the same, and there’s no different code paths taken depending on if the SSE is supported or not, or in other words the exe assumes that SSE(or SSE2) is supported.

    Finally how would you handle turning SSE off for specific parts of the game only? In our game only a small subset needs to be deterministic, while the rest should be as fast as possible. I guess that would mean that we would have to compile different versions of every library used, but that would of course generate duplicate symbols, so I guess it’s not an option. Another thing that could theoretically mess it up, is whole program optimization, since it could start combining SSE and non-SSE functions. Well during writing this I realized that since the problem is PC specific, I guess a DLL would work, along with the other headaches, like memory management problems it would bring, wouldn’t it?

    Another option would be to avoid this mess altogether and use only integer math, witch would work perfectly in our case, since the deterministic code doesn’t need to be that fast. This could be done through some decimal or rational class. It would be reasonable safe if it doesn’t allow converting from or to floating points, but there’s still a small chance of problems if floats are used for intermediates and then converted back to integers. Are there any good libraries around that have a license compatible with games? The boost rational one would seem to fit at first, however it has properties that makes it absolutely worthless in practice, see this thread http://lists.boost.org/Archives/boost/2004/12/77456.php The thread is old, but as far as I can see from the changes, nothing has been done about it.

    1. Best guy to talk to specifically would be Jon Watte. You can find him on twitter or on the network thread in gamedev.net forums. cheers

  11. Back in the day when this solution was viable, FPU’s were an add-on, so mostly fixed point was used, determinism was still damn tricky to achieve. One motorbike game in particular would always diverge after about nine laps. I suspected that somewhere in the code someone was using a “real t” off the hardware clock rather than the “soft t” provided by the simulation. Tracking it down took a long time..

  12. Our tools pipeline requires that any asset produced with the same parameters should create the same binary (verified with MD5). This affected us when we switched from one DXT compression library to another, and also limit us not to use the GPU (through CUDA or OpenCL) – there the differences can be even more staggering.

    Later colleagues from another studio forced all calls to be 64-bit (rather than 80), and I guess that was done (I understood it here), because AMD might not produce the same results as Intel when 80-bit (intermediate) precision is used.

    All in all this page, and all the research is very useful.

    What we do in our games is to have ONE machine doing the simulation, eventually migrating itself to another if needed (player quits). But we are FPS so that would be easier. That’s what I know from our networking and gameplay folks.

  13. As far as effect of floating point calculations on multithreaded programs is concerned, it is not deterministic, due to the reason that it is neither associative nor distributive, its only commutative. So for example, a + ( b + c ) is not equal to ( a + b ) + c.

    Therefore if different threads are adding to sums by acquiring a lock, you need to make sure that all replicas acquire the locks in the same order or else the execution will diverge.

  14. SSE float ops are not deterministic because the standard specifies a minimum error, not a specific error for some operations like reciprocal. You need to call _set_SSE2_enable(0) at startup on all your threads along with your _controlfp calls.

    Multithreading won’t have any affect on the order of operations for floats. The compiler decides on that ordering at build time, so as long as all threads are running the same code, it won’t matter. What will matter is the ordering of results when you merge results, and that can break if you split work, and don’t merge it back together based on a fully deterministic ordering.

  15. I wonder if there is a way to determine the largest accurate mantissa for common architectures after performing a single float operation? I did a quick search on ISI web of science and didn’t find anything. Somebody should really do an investigation!

    1. It wouldn’t help for determinism to determine the largest accurate mantissa on various architectures. A disagreement in a lower bit can propogate to a disagreement in higher bits if the number lands near the edge of rounding.

      It’s probably time to stop depending on floats to be deterministic on PCs, though. No new features on 64 bit CPUs are deterministic across models / manufacturers. All hail the rise of cross platform deterministic int based game and physics engines!

      1. “No new features on 64 bit CPUs are deterministic across models / manufacturers.”

        Could you elaborate on this, please?
        I would be very interested!

        Cheers,
        Bertrand

        1. For example, float ops in SSE2+ are not guaranteed to match even between 2 implementations by the same manufacturer. The spec guarantees minimum accuracies, not specific accuracies.

          1. Hi Chris, I understand that basic SSE2 float ops are still 754 complaint in SSE — certain transcendental functions, fast sqrts, fast inverses, sure — those things are within some minimum accuracy, but the basic float ops are actually fine, in fact afaik BETTER than the non-SSE FPU on intel, because you have 32bit precision internally vs. the weird 80bits precision with the FPU registers with flushing to 32bit precision when written to memory.

          2. I guess your point is noted, the new and cool stuff, well nobody is guaranteeing that it is deterministic at all. Shame really. And if you think determinism is bad on SSE, consider a CUDA physics sim running on ATI vs. NVidia hardware ;)

  16. Although the differences exist, would it be possible to use this method and send corrective states over time so that a client could correct itself, or are the differences likely to result in large divergence quickly?

    1. Yes it is possible, but consider the situation where the corrective states are too large to send. Then you go for 100% determinism. cheers

  17. I used farseer physics engine for my 2D dynamic online multiplayer game “Air Soccer Fever” on WP7 and I had to do alot of hacks to sync physics on two WP7 devices playing a match. Without gamestate syncs, the game would diverge very soon because the game is all about collisions of players and ball.

    So who is going to write the first integer based 2d & 3d physics engine? :)

  18. One idea which I’m considering is: using a 64 bit integer (128 bit with self-programmed logic if you want to get more precise, or larger range) instead of a floating point number for most of the stuff that needs to be deterministic. When you need to convert back to a float (for example, to display in OpenGL) you just divide the integer by like 1,000 or 1,000,000 (depends how much precision you want).

    Any recurring calculations which would accumulate any floating point differences over time (such as a player moving) would become completely deterministic if represented using integers.

    Thanks for all of your networking articles – they were fascinating to read and I’ll definitely find them useful throughout the development of my game engine networking system.

    1. That’s a good idea, but you’ll need to implement your own sqrt, transcendental functions (sin,cos,tan etc.) and so on… and make sure that those implementations are both fast, and 100% deterministic :)

    2. You need to be careful when scaling things. For example, if we have the following calculations,

      a = b * b + c

      Then if you simply do the following,

      b = bfloat * 1000;
      c = cfloat * 1000;

      a = b * b + c;

      afloat = a / 1000;

      You would have wrong results. This is because of b*b, where the scale factor would be squared. So the correct way of doing this would be the following.

      a = b * b / 1000 + c;

  19. Hello. I have been using your website a lot. Thank you very much for doing all this. What would be the best way to dump its entire contents so I have an offline version in case of zombie apocalypse, diablo walking the earth or the mayans being correct?
    Cheers.

  20. I wonder if the various decimal floating point modules would be suitable to this task.

    Python:
    http://docs.python.org/2/library/decimal.html

    Java:
    http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html

    If they both conform to the same standard and calculations are done using the same context (precision, rounding mode, etc.), shouldn’t they yield the same results across different machines/platforms/languages?

    I don’t have access to

  21. I wonder if the various decimal floating point modules that conform to the IEEE 854 standard would be suitable for this.

    http://754r.ucbtest.org/standards/854.pdf

    http://docs.python.org/2/library/decimal.html

    http://docs.oracle.com/javase/1.5.0/docs/api/java/math/BigDecimal.html

    If they conform to the same standard and calculations are performed in the same context (precision, rounding mode, etc.), then shouldn’t they yield the same results across different machines/platforms/languages?

    I don’t have different machines to experiment with, but my trials with a Python test and a Java test on the same machine yield the same results.

  22. Pingback: Real-world number representation | gamemathcode
  23. I am also curious about the effects of floating-point determinism. I’m building a physics engine that is going to be in a pretty damn big world, and I need to know how floating-point maths is going to effect me… or should I just go with 64-bit integers (and divide them, as mentioned above)?

  24. I’m a developer on a synchronized simulation RTS that needs every peer to have EXACT bitwise identical internal state. We’ve successfully used SSE and x87 with controlfp calls to lower precision and control rounding. However, we did recently attempt to switch our FastInvertedSquareRoot to the Intel/AMD calls, and that failed miserably. So if you’re trying to floating point determinism; stay away from the intrinsic hardware calls like (_mm_rsqrt_ss), they’re not the same between Intel and AMD.

  25. Pingback: » Fixed-point arithmetic and determinism in Flash Marsh Games

Leave a Reply

Glenn Fiedler's Game Development Articles and Tutorials