Performance optimization is often about trade-offs, especially when you get into the implementation details of a program. A routine may go faster if you unroll loops to reduce branching overhead, however program code size increases as a result. If you unroll too much you risk exceeding optimal cache sizes and end up with a slower program anyway. Compilers are becoming increasingly capable of evaluating such tradeoffs, leaving the developer to the task of designing higher-level functionality.
GCC has long provided a compiler option “-ffast-math” that allows the compiler to use faster hardware floating point instructions, at the potential expense of IEEE floating point compliance. If your program doesn’t need strict compliance, this setting should theoretically net a boost in floating point performance for free.
Avida is a very integer based program. Floating point math is inherently imprecise and difficult to keep consistent across platforms. As such, we have purposely kept floating point math isolated to the fringes as much as possible. Still, as as scientific program we need to gather and calculate statistics, thus there are some commonly executed floating point routines. The trade-off of a little accuracy for a boost in performance seemed worthwhile, thus we have compiled Avida with ‘fast-math’ enabled for as long as I have been involved with the project, and likely since the beginning.
In recent years I built a consistency test framework that we have been slowly fleshing out with tests to keep Avida as consistent across platforms as possible. Perhaps surprisingly, outside of a few notable exceptions, we haven’t experienced a lot of variance in output consistency. One of the notable exceptions, though, has been compiling Avida with Intel’s ICC suite. One of the stats routines was returning some floating point values with small variance relative GCC on all other platforms.
In my quest to bring ICC into consistency, I did some digging into the floating point settings. ICC by default uses a ‘fast-math’ like mode. However, it clearly was using a variation in floating point instructions generated compared to GCC. My first thought was that the optimizers may been exploiting commutativity to reorder operations, which could potentially affect rounding order. After a bit of failed parenthesis wrangling I decided to test if full IEEE compliance would fix the variance and then try relaxing things. So my next step was to disable ‘fast-math’ on GCC. This action alone changed the results to match ICC. So the Intel compiler was actually generating more correct results. Interesting.
Conventional wisdom says that ‘fast-math’ should be, well, faster. In order to get consistency though, it may no longer be worth the trade-off. So I tested performance to see how much slower Avida would be, given the isolated nature of floating point operations.
tatooine:development brysonda$ ./run_tests -j 8 -p
# results truncated for brevity
analyze_truncate_lineage_fulllandscape : exceeded
- wall: 0.92 base = 4.8564 test = 4.4588
- user: 0.92 base = 9.3931 test = 8.6456
Wait a minute… its faster?
Disabling ‘fast-math’ not only solved the consistency problem, it made Avida faster in certain cases and had no effect in others. As seen above, it made as much as an 8% improvement on a first generation 8-core Mac Pro. Further tests on my Power Mac G5 Quad showed around 2% improvement in some tests and no difference in others.
In the end I decided to disable ‘fast-math’ for GCC and explicitly set full IEEE compliance mode on ICC, just to be sure. Performance testing showed that no performance was lost, yet we now have full cross platform consistency between ICC and GCC. Win win!
The take-away message is that it can be good to question conventional wisdom. Also, just because it says “fast” in the name, it may not necessarily be true in all cases.