Gaussian distributed weights for LLMs

The previous post looked at the FP4 4-bit floating point format. This post will look at another 4-bit floating point format, NF4, and higher precision analogs. NF4 and FP4 are common bitsandbytes 4-bit data types. If you download LLM weights from Hugging Face quantized to four bits, the weights might be in NF4 or FP4 format. Or maybe some other format: there’s a surprising amount of variety in how 4-bit numbers are implemented.

Why NF4

LLM parameters have a roughly Gaussian distribution, and so evenly spaced numeric values are not ideal for parameters. Instead, you’d like numbers that are closer together near 0.

The FP4 floating point numbers, described in the previous post, are spaced 0.5 apart for small values, and the larger values are spaced 1 or 2 apart. That’s hardly a Gaussian distribution, but it’s closer to Gaussian than a uniform distribution would be. NF4 deliberately follows more of a Gaussian distribution.

QLoRA

The QLoRA formats [1], unlike FP4, are not analogs of IEEE numbers. The bits are not interpreted as sign, exponent, and mantissa, but rather as integers to be used as indexes. An NFn number is an index into a list of 2n real numbers with Gaussian spacing. To put it another way, the numbers represented by NFn have uniformly distributed z-scores.

That makes sense at a high level, but the paper [1] is hard to follow in detail. It says

More formally, we estimate the 2k values qi of the data type as follows:

q_i = \frac{1}{2}\left(Q_X\left(\frac{i}{2^k + 1}\right) + Q_X\left(\frac{i+1}{2^k + 1} \right) \right)

where QX(·) is the quantile function of the standard normal distribution N(0, 1).

The paper doesn’t give the range of i but it says there are 2k values, implying that i runs from 0 to 2k −1 or from 1 to 2k. Either way runs into infinite values since Q(0) = −∞ and Q(1) = ∞. We could avoid infinities by letting i run from 1 to 2n − 1.

The next sentence is puzzling.

A problem for a symmetric k-bit quantization is that this approach does not have an exact representation of zero, which is an important property to quantize padding and other zero-valued elements with no error.

I understand the desire to represent 0 exactly, but the equation above has an exact representation of 0 when i = 2n − 1. Perhaps the authors had in mind that i takes on the values ½, 1 + ½, 2 + ½, …, 2n − ½. This would be reasonable, but a highly unusual use of notation. It seems that the real problem is not the lack of a representation of 0 but an unused index, with i running from 1 to 2n − 1.

To be fair, the first sentence quoted above says “we estimate the 2k values …” and so the equation above may not be intended as a definition but as motivation for the actual definition.

Reproducing NF4

The authors give a procedure for using 2n values of i and obtaining an exact representation of 0, and they give a list of NF4 values in Appendix E. I was not able to get the two to match. I implemented a few possible interpretations of the procedure described in the paper, and each approximates the list of values in the appendix, but not closely.

The following code, written with the help of ChatGPT, reverse engineers the NF4 values to 8 decimal places, i.e. to the precision of a 32-bit floating point number.

from scipy.stats import norm

Q = norm.ppf

α  = 0.9677083
Z  = Q(α)
δ1 = (α - 0.5)/7
δ2 = (α - 0.5)/8

q = [0]*16
for i in range(7):
    q[i] = -Q(α - i*δ1)/Z
for i in range(8):
    q[i+8] = Q(0.5 + (i+1)*δ2)/Z
    
# Values given in Appendix E
NF4 = [
    -1.0,
    -0.6961928009986877,
    -0.5250730514526367,
    -0.39491748809814453,
    -0.28444138169288635,
    -0.18477343022823334,
    -0.09105003625154495,
    0.0,
    0.07958029955625534,
    0.16093020141124725,
    0.24611230194568634,
    0.33791524171829224,
    0.44070982933044434,
    0.5626170039176941,
    0.7229568362236023,
    1.0
]

# Compare 
for i in range(16):
    print(i, NF4[i] - q[i])

The magic number α = 0.9677083 is a mystery. I asked ChatGPT to look into this further, and it said that bitsandbytes uses α = 929/960 = 0.9677083333333333. When I use this value for α the precision is about the same, which is fine. However, the values in the paper were given to 16 decimal places, so I thought it might be able to match the values to more precision.

Quibbles over the exact values of NF4 aside, the NF4 format works well in practice. Models. quantized to 4 bits using NF4 perform better than models quantized to other 4-bit formats on some benchmarks.

Related posts

[1] QLoRA: Efficient Finetuning of Quantized LLMs by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. https://arxiv.org/abs/2305.14314.

4-bit floating point FP4

In ancient times, floating point numbers were stored in 32 bits. Then somewhere along the way 64 bits became standard. The C programming language retains the ancient lore, using float to refer to a 32-bit floating point number and double to refer to a floating point number with double the number of bits. Python simply uses float to refer to the most common floating point format, which C calls double.

Programmers were grateful for the move from 32-bit floats to 64-bit floats. It doesn’t hurt to have more precision, and some numerical problems go away when you go from 32-bits to 64-bits. (Though not all. Something I’ve written about numerous times.)

Neural networks brought about something extraordinary: demand for floating point numbers with less precision. These networks have an enormous number of parameters, and its more important to fit more parameters into memory than it is to have higher precision parameters. Instead of double precision (64 bits) developers wanted half precision (16 bits), or even less, such as FP8 (8 bits) or FP4 (4 bits). This post will look at 4-bit floating point numbers.

Why even bother with floating point numbers when you don’t need much precision? Why not use integers? For example, with four bits you could represent the integers 0, 1, 2, …, 15. You could introduce a bias, subtracting say 7 from each value, so your four bits represent −7 through 8. Turns out it’s useful to have a more dynamic range.

FP4

Signed 4-bit floating point numbers in FP4 format use the first bit to represent the sign. The question is what to do with the remaining three bits. The notation ExMy denotes a format with x exponent bits and y mantissa bits. In the context of signed 4-bit numbers,

xy = 3

but in other contexts the sum could be larger. For example, for an 8-bit signed float, xy = 7.

For 4-bit signed floats we have four possibilities: E3M0, E2M1, E1M2, and E0M3. All are used somewhere, but E2M1 is the most common and is supported in Nvidia hardware.

A number with sign bit s, exponent e, and mantissa m has the value

(−1)s 2eb (1 + m/2)

where b is the bias. The purpose of the bias is to allow positive and negative exponents without using signed numbers for e. So, for example, if b = 1 and e = 1, 2, or 3 then the exponent part 2eb can represent 1, 2, or 4.

The bias impacts the range of possible numbers but not their relative spacing. For any value of bias b, the E3M0 format is all exponent, no mantissa, and so its possible values are uniformly distributed on a log scale. The E0M3 format is all mantissa, so its values are uniformly distributed on a linear scale. The E1M2 and E2M1 formats are unevenly spaced on both log and linear scales.

There is an exception to the expression above for converting (sem) into a real number when e = 0. In that case, m = 0 represents 0 and m = 1 represents ½.

Table of values

Since there are only 16 possible FP4 numbers, it’s possible to list them all. Here is a table for the E2M1 format.

Bits s exp m  Value
-------------------
0000 0  00 0     +0
0001 0  00 1   +0.5
0010 0  01 0     +1
0011 0  01 1   +1.5
0100 0  10 0     +2
0101 0  10 1     +3
0110 0  11 0     +4
0111 0  11 1     +6
1000 1  00 0     -0
1001 1  00 1   -0.5
1010 1  01 0     -1
1011 1  01 1   -1.5
1100 1  10 0     -2
1101 1  10 1     -3
1110 1  11 0     -4
1111 1  11 1     -6

Note that even in this tiny floating point format, there are two zeros, +0 and −0, just like full precision floats. More on that here.

Pychop library

The Python library Pychop emulates a wide variety of reduced-precision floating point formats. Here is the code that used Pychop to create the table above.

import pychop

# Pull the format metadata from Pychop.
spec = pychop.MX_FORMATS["mxfp4_e2m1"]
assert (spec.exp_bits, spec.sig_bits) == (2, 1)

def e2m1_value(s: int, e: int, m: int) -> float:
    sign = -1.0 if s else 1.0

    # Subnormal / zero
    if e == 0:
        return sign * (m / 2.0)

    # Normal
    return sign * (2.0 ** (e - 1)) * (1.0 + m / 2.0)

def display_value(bits: int, x: float) -> str:
    if bits == 0b0000:
        return "+0"
    if bits == 0b1000:
        return "-0"
    return f"{x:+g}"

rows = []
for bits in range(16):
    s = (bits >> 3) & 0b1
    e = (bits >> 1) & 0b11
    m = bits & 0b1
    x = e2m1_value(s, e, m)

    rows.append(
        {
            "Bits": f"{bits:04b}",
            "s": s,
            "exp_bits": f"{e:02b}",
            "m": m,
            "Value": display_value(bits, x),
        }
    )

# Pretty-print the table.
header = f"{'Bits':<4} {'s':>1} {'exp':>3} {'m':>1} {'Value':>6}"
print(header)
print("-" * len(header))
for row in rows:
    print(
        f"{row['Bits']:<4} " f"{row['s']:>1} "
        f"{row['exp_bits']:>3} "
        f"{row['m']:>1} "
        f"{row['Value']:>6}"
    )

Other formats

FP4 isn’t the only 4-bit floating point format. There’s a surprisingly large number of formats in use. I intend to address another format my next post.

Update: See the next post for a discussion of NF4, a format whose representable numbers more closely matches the distribution of LLM weights.

Newton diameters

Let f(xy) be an nth degree polynomial in x and y. In general, a straight line will cross the zero set of f in n locations [1].

Newton defined a diameter to be any line that crosses the zero set of f exactly n times. If

f(xy) = x² + y² − 1

then the zero set of f is a circle and diameters of the circle in the usual sense are diameters in Newton’s sense. But Newton’s notion of diameter is more general, including lines the cross the circle without going through the center.

Newton’s theorem of diameters says that if you take several parallel diameters (in his sense of the word), the centroids of the intersections of each diameter with the curve f(xy) = 0 all line on a line.

To illustrate this theorem, let’s look at the elliptic curve

y² = x³ − 2x + 1,

i.e. the zeros of f(xy) = y² − (x³ − 2x + 1). This is a third degree curve, and so in general a straight line will cross the curve three times [2].

The orange, green, and red lines are parallel, each intersecting the blue elliptic curve three times. The dot on each line is the centroid of the intersection points, the center of mass if you imagine each intersection to be a unit point mass. The centroids all lie on a line, a vertical line in this example though in general the line could have any slope.

I hadn’t seen this theorem until I ran across it recently when skimming [3]. Search results suggest the theorem isn’t widely known, which is surprising for a result that goes back to Newton.

Update: See this post for another illustration using a fifth degree polynomial.

Related posts

[1] Bézout’s theorem says a curve of degree m and a curve of degee n will always intersect in mn points. But that includes complex roots, adds a line at infinity, and counts intersections with multiplicity. So a line, a curve of degree 1, will intersect a curve of degree n at n points in this extended sense.

[2] See the description of Bézout’s theorem in the previous footnote. In the elliptic curve example, the parallel lines meet at a point at infinity. A line that misses the closed component of the elliptic curve and only passes through the second component has 1 real point of intersection but there would be 2 more if we were working in ℂ² rather than ℝ².

In algebraic terms, the system of equations

y² = x³ − 2x + 1
3y = 2x + k

has three real solutions for small values of k, but for sufficiently large values of |k| two of the solutions will be complex.

[3] Mathematics: Its Content, Methods, and Meaning. Edited by A. D. Aleksandrov, A. N. Kolmogorov, and M. A. Lavrent’ev. Volume 1.

Rényi’s parking constant

Parallel parking photo by IFCAR, Public Domain

Imagine parallel parking is available along the shoulder of a road, but no parking spaces are marked.

The first person to park picks a spot on the shoulder at random. Then another car also chooses a spot along the shoulder at random, with the constraint that the second car can’t overlap the first. This process continues until no more cars can park. How many people can park this way?

Assume all cars have the same length, and we choose our units so that cars have length 1. The expected number of cars that can park is a function M(x) of the length of the parking strip x. Clearly if x < 1 then M(x) = 0. Alfréd Rényi [1] found that for x ≥ 1, M(x) satisfies the equation

M(x) = 1 + \frac{2}{x-1}\int_0^{x-1} M(t)\, dt

This is an unusual equation, difficult to work with because it defined M only implicitly. It’s not even clear that the equation has a solution. But it does, and the ratio of M(x) to x approaches a constant as x increases.

m = \lim_{x\to\infty} \frac{M(x)}{x} = 0.74759\ldots

The number m is known as Rényi’s parking constant.

This says for a long parking strip, parking sequentially at random will allow about 3/4 as many cars to park as if you were to pack the cars end-to-end.

More posts on Rényi’s work

[1] Steven R. Finch. Mathematical Constants. Cambridge University Press, 2003.

Calculating when a planet will appear to move backwards

When we say that the planets in our solar system orbit the sun, not the earth, we mean that the motions of the planets is much simpler to describe from the vantage point of the sun. The sun is no more the center of the universe than the earth is. Describing the motion of the planets from the perspective of our planet is not wrong, but it is inconvenient. (For some purposes. It’s quite convenient for others.)

The word planet means “wanderer.” This because the planets appear to wander in the night sky. Unlike stars that appear to orbit the earth in a circle as the earth orbits the sun, planets appear to occasionally reverse course. When planets appear to move backward this is called retrograde motion.

Apparent motion of Venus

Here’s what the motion of Venus would look like over a period of 8 years as explored here.

Venus from Earth

Venus completes 13 orbits around the sun in the time it takes Earth to complete 8 orbits. The ratio isn’t exactly 13 to 8, but it’s very close. Five times over the course of eight years Venus will appear to reverse course for a few days. How many days? We will get to that shortly.

When we speak of the motion of the planets through the night sky, we’re not talking about their rising and setting each day due to the rotation of the earth on its axis. We’re talking about their motion from night to night. The image above is how an observer far above the Earth and not rotating with the Earth would see the position of Venus over the course of eight years.

The orbit of Venus as seen from earth is beautiful but complicated. From the Copernican perspective, the orbits of Earth and Venus are simply concentric circles. You may bristle at my saying planets have circular rather than elliptical orbits [1]. The orbits are not exactly circles, but are so close to circular that you cannot see the difference. For the purposes of this post, we’ll assume planets orbit the sun in circles.

Calculating retrograde periods

There is a surprisingly simple equation [2] for finding the points where a planet will appear to change course:

\cos(kt) = \frac{\sqrt{Rr}}{R + r - \sqrt{Rr}}

Here r is the radius of Earth’s orbit and R is the radius of the other planet’s orbit [3]. The constant k is the difference in angular velocities of the two planets. You can solve this equation for the times when the apparent motion changes.

Note that the equation is entirely symmetric in r and R. So Venusian observing Earth and an Earthling observing Venus would agree on the times when the apparent motions of the two planets reverse.

Example calculation

Let’s find when Venus enters and leaves retrograde motion. Here are the constants we need.

r = 1       # AU
R = 0.72332 # AU

venus_year = 224.70 # days
earth_year = 365.24 # days
k = 2*pi/venus_year - 2*pi/earth_year

c = sqrt(r*R) / (r + R - sqrt(r*R))

With these constants we can now plot cos(kt) and see when it equals c.

This shows there are five times over the course of eight years when Venus is in apparent retrograde motion.

If we set time t = 0 to be a time when Earth and Venus are aligned, we start in the middle of retrograde period. Venus enters prograde motion 21 days later, and the next retrograde period begins at day 563. So out of every 584 days, Venus spends 42 days in retrograde motion and 542 days in prograde motion.

Related posts

[1] Planets do not exactly orbit in circles. They don’t exactly orbit in ellipses either. Modeling orbits as ellipses is much more accurate than modeling orbits as circles, but not still not perfectly accurate.

[2] 100 Great Problems of Elementary Mathematics: Their History and Solution. By Heinrich Dörrie. Dover, 1965.

[3] There’s nothing unique about observing planets from Earth. Here “Earth” simply means the planet you’re viewing from.