**Deep learning** has spurred interest in **novel floating point formats**. Algorithms often don’t need as much precision as standard IEEE-754 doubles or even single precision floats. Lower precision makes it possible to hold more numbers in memory, reducing the time spent swapping numbers in and out of memory. Also, low-precision circuits are far less complex. Together these can benefits can give significant speedup.

Here I want to look at **bfloat16**, or BF16 for short, and compare it to 16-bit number formats I’ve written about previously, IEEE and posit. BF16 is becoming a de facto standard for deep learning. It is supported by several deep learning accelerators (such as Google’s TPU), and will be supported in Intel processors two generations from now.

## Bit layout

The BF16 format is sort of a cross between FP16 and FP32, the 16- and 32-bit formats defined in the IEEE 754-2008 standard, also known as half precision and single precision.

BF16 has 16 bits like FP16, but has the same number of exponent bits as FP32. Each number has 1 sign bit. The rest of the bits in each of the formats are allocated as in the table below.

|--------+------+----------+----------| | Format | Bits | Exponent | Fraction | |--------+------+----------+----------| | FP32 | 32 | 8 | 23 | | FP16 | 16 | 5 | 10 | | BF16 | 16 | 8 | 7 | |--------+------+----------+----------|

This makes conversion between FP32 and BF16 simple. To go from FP32 to BF16, just cut off the last 16 bits. To got up from BF16 to FP32, pad with zeros, except for some edge cases regarding denormalized numbers.

## Precision

The epsilon value, the smallest number ε such that 1 + ε > 1 in machine representation, is 2^{–e} where *e* is the number of fraction bits. BF16 has much less precision near 1 than the other formats.

|--------+------------| | Format | Epsilon | |--------+------------| | FP32 | 0.00000012 | | FP16 | 0.00390625 | | BF16 | 0.03125000 | |--------+------------|

## Dynamic range

The dynamic range of bfloat16 is similar to that of a IEEE single precision number. Relative to FP32, BF16 sacrifices precision to retain range. Range is mostly determined by the number of exponent bits, though not entirely.

Dynamic range in decades is the log base 10 of the ratio of the largest to smallest representable positive numbers. The dynamic ranges of the numeric formats are given below. (Python code to calculate dynamic range is given here.)

|--------+-------| | Format | DR | |--------+-------| | FP32 | 83.38 | | BF16 | 78.57 | | FP16 | 12.04 | |--------+-------|

## Comparison to posits

The precision and dynamic range of posit numbers depends on how many bits you allocate to the maximum exponent, denoted *es* by convention. (Note “maximum.” The number of exponent bits varies for different numbers.) This post explains the anatomy of a posit number.

Posit numbers can achieve more precision and more dynamic range than IEEE-like floating point numbers with the same number of bits. Of course there’s no free lunch. Posits represent large numbers with low precision and small numbers with high precision, but this trade-off is often what you’d want.

For an *n*-bit posit, the number of fraction bits near 1 is *n* – 2 – *es* and so epsilon is 2 to the exponent *es* – *n* – 2. The dynamic range is

which is derived here. The dynamic range and epsilon values for 16-bit posits with *es* ranging from 1 to 4 are given in the table below.

|----+--------+-----------| | es | DR | epsilon | |----+--------+-----------| | 1 | 16.86 | 0.0000076 | | 2 | 33.82 | 0.0000153 | | 3 | 37.43 | 0.0000305 | | 4 | 143.86 | 0.0000610 | |----+--------+-----------|

For all the values of *es* above, a 16-bit posit number has a smaller epsilon than either FP16 or BF16. The dynamic range of a 16-bit posit is larger than that of a FP16 for all values of *es*, and greater than BF16 and FP32 when *es* = 4.

Posts like this help me recall my early days developing complex embedded systems on primitive (though state-of-the-art for their day) 8-bit processors running at 1-4 MHz. No FPU, not even an integer divide instruction, and only rarely a full-width multiplier that, if present, was never single-cycle.

I was developing instrumentation that had to cover wide dynamic ranges and do lots of signal processing on raw sensor data to obtain useful readings, and do so in real-time. In 64KB of total memory, split between XIP (eXecute-In-Place) UV-EPROMs and static RAM.

The first priority was to do as much work as possible in integer form: Multi-byte integers were useful, but became impractical at 32-bits (value density vs. memory consumed vs. load/store time). To capture more dynamic range in less space, we’d go to fixed-point when we needed to, either 8 or 16-bit.

This led to the need to have multiple fixed-point representations, each tailored to the need to balance signal with noise: We tried to keep the uncertainty limited to the lowest bit, which in turn demanded careful hand-tuning of every algorithm to conserve those low bits while ruthlessly discarding them as error propagation eliminated their usefulness.

The most demanding part of this hand-optimization was limiting the number and width of intermediate representations: Using compact representations on the input mattered little if too much time was spent repeatedly using less compact representations to conserve both accuracy and resolution.

The math libraries routinely used lookup tables for many functions, particularly trig, logs and square root. Care was needed when choosing to use them, as each occupied precious EPROM. By focusing on algorithmic “choke points”, I could shove much of the computational burden into pre-computed lookup tables, with the table itself handling not only the computation, but also most changes in representation.

Mistakes were all too easy to make. For the tables themselves, each was developed and checked in FORTRAN on our then-new PDP-11/73.

To ensure each table was used correctly, we relied on strong typing. Unfortunately, the languages and compilers of the day targeting 8-bit micros focused primarily on the low-level hardware representations, providing typing primarily for higher-level constructs, which meant developing a preprocessor to ensure type-safety for our multiple fixed-point types and function tables.

Many iterations between algorithm design and implementation were needed to obtain a result that delivered the required performance in both the bit and time domains.

The one mathematical domain that most resisted the above techniques was time-based statistics, which dominated computation after the above optimizations were in place.

While table-based computation provided the biggest boost, the use of a number of different numerical value representations was what enabled that boost.

Which leads me to wonder why so much work is going into new floating point formats. I believe it may be useful to split the discussion in half: Storage vs. calculation. Use standardized formats as external representations for storing or sharing values is certainly useful, but then use whatever internal representation works best (on several axes) for the operations each value participates in.

Trying to create a single representation to simultaneously meet so many needs seems a bit odd to me.

I further believe there is an inherent adversity to using a large number of internal representations; that is, an implicit desire to find that “one, best, true” representation. Why not use them all?

There is a desire to let hardware do all the heavy lifting, which hides the renormalizations within the FPU black box, creating the associated lack of visibility regarding FPU bugs. And the associated trade-off between hardware complexity and software simplicity.

I’d like to see separate efforts for how best to communicate values and how best to operate on/with them.

FPGA developers are comfortable using arbitrary representations, the minimum needed to completely capture the desired characteristics. Yet for user-facing registers, they were always careful to convert those values to conventional forms. I think we call can benefit from that perspective when it comes to how we use our bits.

We saw such an evolution with the internet: When data links were slow, great emphasis was placed on conserving that bandwidth, with a number of Huffman-encoded binary representations fighting for early dominance, echoing the endianness and bus width battles seen in hardware at the time. With the advent of the web, things have evolved to using text as the communication representation, often JSON, with HTTP itself handling compression when bandwidth optimization is desired. This is seen even within databases, where text is used to support arbitrary numeric values free of any predetermined underlying hardware representation.

For the specific needs of neural network weights, it may be prudent to first consider representations that separate the exponent from the mantissa, having them be completely independent, before (if ever) seeking a unifying representation. I would expect different net types would benefit from different representations.