Compressing a set of hash values

Suppose you have a set of k hash values, each n bits long. Can you compress the set into less than kn bits?

It’s not possible to compress a list of hashes into less than kn bits, but you can hash a set into fewer bits.

Suppose you have a set of 230, roughly a billion, 64-bit hashes. Sort the set and look at the size of gaps between elements. You might expect that consecutive items on the list are roughly 234 apart, and so you could reconstruct the list by reporting the first item and the gaps between the rest, which are 34-bit numbers, not 64-bit numbers, a savings of 30 bits per hash.

This doesn’t exactly work, but it’s the kernel of a good idea. We don’t know that the distance between hashes can be represented by a 34 bit number. The gap could be more or less than 234, but we don’t expect it to often be much more than 234. So we use a variable-length encoding such that when the distance between values is on the order of 234, or less, we save bits, but we allow for the distance to be any size.

Applications

What we’ve described is the essence of Golomb-Rice coding. Its implementation in the Bitcoin protocol is referred to as Golomb-Coded Sets (GCS), described in BIP 158. Golomb-Rice coding is also used other applications where the  values to be compressed are not hashes, such as in lossless auto compression.

Encoding

Let’s go into some detail as to how the distances between sorted values are represented. Suppose you expect the differences to be on the order of M where M is a power of 2. For each difference d, let q and r be the quotient and remainder by M, i.e.

dqMr.

Encode q as a unary number, i.e. string of q 1s, and encode r as an ordinary binary number. Then Golomb-Rice coding of d is the concatenation of the representations of q and r. with a 0 in the middle as a separator. Using || to denote string concatenation we have

unary(q)  ||  0  ||  binary(r).

In general, unary encoding is extremely inefficient, but we’re betting that q will typically be quite small.

The reason we require M to be a power of 2 is so the representation of r will have a fixed length [1].

Let’s work out an example. Suppose M = 220 and

d = 222 + 123456 = 4 × M + 123456.

Then we write q as 1111 and r as 0011110001001000000 and encode d as the string

111100011110001001000000

Decoding

You can concatenate the encodings of consecutive d values and still be able to unambiguously decode the result. Because the r representations have a constant length, you know when an r ends and the next q begins.

For example, suppose we have the following string of bits.

1110010011001011001011111001000000110010001110011101111000101111011

We decode the string from left to right. We count how many ones we see, skip over a 0, then regarding the next 20 bits as a binary number.

111 0 01001100101100101111 1 0 01000000110010001110 0 11101111000101111011

We see three 1s before the first 0 and so we conclude q1 = 3. We skip over the 0 and read the value of r.

r1 = 01001100101100101111two = 314159.

Next we see a single 1 before the next 0, so q2 = 1. We read the next value of r as

r2 =  01000000110010001110two = 265358.

Next we see a 0, i.e. there are no 1s, and so the final q3 = 0. And we have

r3 =  11101111000101111011two = 979323.

So our reconstructed values of d are

d1 = q1 M+ r1 = 3 × 220 + 314159 = 3459887

d2 = q2 M+ r2 = 1 × 220 + 265358 = 1313934

d3 = q3 M+ r3 = 0 × 220 + 979323 = 979323.

Now these are difference values. We need to know the smallest value m in order to construct the original set of values from the differences. Then the full values are m, m + d1, m + d1 + d2, and m + d1 + d2 + d3,.

Related posts

[1] This is the Rice part. Robert Rice simplified Samuel Golomb’s encoding scheme in the special case that M is a power of 2.

Automation and Validation

It’s been said whatever you can validate, you can automate. An AI that produces correct work 90% of the time could be very valuable, provided you have a way to identify the 10% of the cases where it is wrong. Often verifying a solution takes far less computation than finding a solution. Examples here.

Validating AI output can be tricky since the results are plausible by construction, though not always correct.

Consistency checks

One way to validate output is to apply consistency checks. Such checks are necessary, but not sufficient, and often easy to implement. An simple consistency check might be that inputs to a transaction equal outputs. A more sophisticated consistency check might be conservation of energy or something analogous to it.

Certificates

Some problems have certificates, ways of verifying that a calculation is correct that can be evaluated with far less effort than finding the solution that they verify. I’ve written about certificates in the context of optimization, solving equations, and finding prime numbers.

Formal methods

Correctness is more important in some contexts than others. If a recommendation engine makes a bad recommendation once in a while, the cost is a lower probability of conversion in a few instances. If an aircraft collision avoidance system makes an occasional error, the consequences could be catastrophic.

When the cost of errors is extremely high, formal verification may be worthwhile. Formal correctness proofs using something like Lean or Rocq are extremely tedious and expensive to create, and hence not economical. But if an AI can generate a result and a formal proof of correctness, hurrah!

Who watches the watchmen?

But if an AI result can be wrong, why couldn’t a formal proof generated to defend the result also be wrong? As the Roman poet Juvenal asked, Quis custodiet ipsos custodes? Who will watch the watchmen?

An AI could indeed generate an incorrect proof, but if it does, the proof assistant will reject it. So the answer to who will watch Claude, Gemini, and ChatGPT is Lean, Rocq, and Isabelle.

Who watches the watchers of the watchmen?

Isn’t it possible that a theorem prover like Rocq could have a bug? Of course it’s possible; there is no absolute certainty under the sun. But hundreds of PhD-years of work have gone into Rocq (formerly Coq) and so bugs in the kernel of that system are very unlikely. The rest of the system is bootstrapped, verified by the kernel.

Even so, an error in the theorem prover does not mean an error in the original result. For an incorrect result to slip through, the AI-generated proof would have to be wrong in a way that happens to exploit an unknown error in the theorem prover. It is far more likely that you’re trying to prove the wrong thing than that the theorem prover let you down.

I mentioned collision avoidance software above. I looked into collision avoidance software when I did some work for Amazon’s drone program. The software that was formally verified was also unrealistic in its assumptions. The software was guaranteed to work correctly, if two objects are flying at precisely constant velocity at precisely the same altitude etc. If everything were operating according to geometrically perfect assumptions, there would be no need for collision avoidance software.

Regular expressions that cross lines

One of the fiddly parts of regular expressions is how to handle line breaks. Should regular expression searches be applied one line at a time, or should an entire file be treated as a single line?

This morning I was trying to track down a LaTeX file that said “discussed in the Section” rather than simply “discussed in Section.” I wanted to search on “the Section” to see whether I had a similar error in other files.

Line breaks don’t matter to LaTeX [1], so “the” could be at the end of one line and “Section” at the beginning of another. I found what I was after by using

    grep -Pzo "the\s+Section" foo.tex

Here -P tells grep to use Perl regular expressions. That’s not necessary here, but I imprinted on Perl regular expressions long ago, and I use PCRE (Perl compatible regular expressions) whenever possible so I don’t have to remember the annoying little syntax differences between various regex implementations.

The -z option says to treat the entire file as one long string. This eliminates the line break issue.

The -o option says to output only what the regular expression matches. Otherwise grep will return the matching line. Ordinarily that wouldn’t be so bad, but because of the -z option, the matching line is the entire file.

The \s+ characters between the and Section represent one or more whitespace characters, such as a space or a newline.

The -P flag is a Gnu feature, so it works on Linux. But macOS ships with BSD-derived versions of its utilities, and its version grep does not support the -P option. On my Macbook I have ggrep mapped to the Gnu version of grep.

Another option is to use ripgrep rather than grep. It uses Perl-like regular expressions, and so there is no need for anything like the -P flag. The analog of -z in ripgrep is -U, so the counterpart of the command above would be

    ripgrep -Uo "the\s+Section" foo.tex

Usually regular expression searches are so fast that execution time doesn’t matter. But when it does matter, ripgrep can be an order of magnitude faster than grep.

Related posts

[1] LaTeX decides how to break lines in the output independent of line breaks in the input. This allows you to arrange the source file logically rather than aesthetically.

Obscuring P2P nodes with Dandelion

The weakest link in the privacy of cryptocurrency transactions is often outside the blockchain. There are technologies such as stealth addresses and subaddresses to try to thwart attempts to link transactions to individuals. They do a good job of anonymizing transaction data, but the weak link may be metadata, as is often the case.

Cryptocurrency nodes circulate transaction data using a peer-to-peer network. An entity running multiple nodes can compare when data arrived at each of its nodes and triangulate to infer which node first sent a set of transactions. The Dandelion protocol, and its refinement Dandelion++, aims to mitigate this risk. Dandelion++ is currently used in Monero and a few other coins; other cryptocurrencies have considered or are considering using it.

Dandelion plant

The idea behind the Dandelion protocol is to have a “stalk” period and a “diffusion” period. Imagine data working up the stalk of a dandelion plant before diffusing like seeds in the wind. The usual P2P process is analogous to simply blowing on the seed head [1].

During the stalk period, information travels from one node to one node. Then after some number of hops, the diffusion process begins; the final node in the stalk period diffuses the information to all its peers. An observer with substantial but not complete visibility of the network may be able to determine which node initiated the diffusion, but maybe not the node at the other end of the stem.

A natural question is how this differs from something like Tor. In a nutshell, Tor offers identity protection before you enter a P2P network, and Dandelion offers identity protection inside the P2P network.

For more details, see the original paper on Dandelion [2].

Related posts

[1] The original paper on Dandelion uses a dandelion seed as the metaphor for the protocol. “The name ‘dandelion spreading’ reflects the spreading pattern’s resemblance to a dandelion seed head and refers to the diagram below. However, other sources refer to the stalk and head of the dandelion plant, not just a single seed. Both mental images work since the plant has a slightly fractal structure with a single seed looking something like the plant.

Illustration from Dandelion protocol paper

[2] Shaileshh Bojja Venkatakrishnan, Giulia Fanti, Pramod Viswanath. Dandelion: Redesigning the Bitcoin Network for Anonymity. Proceedings of the ACM on Measurement and Analysis of Computing Systems, Volume 1, Issue 1 Article No.: 22, Pages 1–34. Available here: https://doi.org/10.1145/3084459.

Monero subaddresses

Monero has a way of generating new addresses analogous to the way HD wallets generate new addresses for Bitcoin. In both cases, the recipient’s software can generate new addresses to receive payments that others cannot link back to the recipient.

Monero users have two public/private keys pairs: one for viewing and one for spending. Let Ks and ks be the public and private spending keys, and let Kv and kv be the public and private viewing keys. Then the user’s ith subaddress is given by

\begin{align*} K^s_i &= K^s + H(k^v, i) G \\ K^v_i &= k^v K^s_i \end{align*}

Here G is a generator for the elliptic curve Ed25519 and H is a hash function. The hash function output and kv are integers; the public keys, denoted by capital Ks with subscripts and superscripts, are points on Ed25519. The corresponding private keys are

\begin{align*} k^s_i &= k^s + H(k^v, i) \\ k^v_i &= k^v + k^s_i \end{align*}

As with hierarchical wallets, the user scans the blockchain to see which of his addresses have received funds.

A user may choose to give a different subaddress for each transaction for added security, or to group transactions for accounting purposes.

Note that in addition to subaddresses, Monero uses stealth addresses. An important difference between subaddresses and stealth addresses is that recipients generate subaddresses, and senders generate stealth addresses. Someone could send you money to the same subaddress twice, failing to create a new stealth address. This is not possible if you give the sender a different subaddress each time.

Janus attacks

The possibility of a Janus attack is a fly in the subaddress ointment. If an attacker suspects that two subaddresses belong to the same wallet, they can confirm this suspicion by sending a transaction to one subaddress and making it look like it came from the other. If the recipient confirms receipt of the funds, they inform the attacker that his suspicion was correct. A cautious user might not confirm receipt of funds, but a cryptocurrency exchange would. You can read more about the details of a Janus attack here.

Related posts

How stealth addresses work in Monero

Suppose Alice runs a confidential restaurant. Alice doesn’t want there to be any record of who visited her restaurant but does want to get paid for her food. She accepts Monero, and instead of a cash register there are two QR codes on display, one corresponding to her public view key A and the other corresponding to her public spend key S.

How Bob buys his burger

A customer Bob walks into the restaurant and orders a burger and fries. When Bob pays Alice, here’s what’s going on under the hood.

Bob is using software that generates a random integer r and multiplies it by a point G on an elliptic curve, specifically ed25519, obtaining the point

R = rG

on the curve. The software also multiplies Alice’s view key A, a point on the same elliptic curve, by r, then runs a hash function H on the produce rA that returns an integer k.

kH(rA).

Finally, Bob’s software computes the point

PkGS

and sends Alice’s cash register, i.e. her crypto wallet, the pair of points (PR). The point P is a stealth address, an address that will only be used this one time and cannot be linked to Alice or Bob [1]. The point R is additional information that helps Alice receive her money.

How Alice gets paid

Alice and Bob share a secret: both know k. How’s that?

Alice’s public view key A is the product of her private view key a and the group generator G [2]. So when Bob computes rA, he’s computing r(aG). Alice’s software can multiply the point R by a to obtain a(rG).

rAr(aG) = a(rG) = aR.

Both Alice and Bob can hash this point—which Alice thinks of as aR and Bob thinks of as rA—to obtain k. This is ECDH: elliptic curve Diffie-Hellman key exchange.

Next, Alice’s software scans the blockchain for payments to

PkGS.

Note that P is on the blockchain, but only Alice and Bob know how to factor P into kGS because only Alice and Bob know k. And only Alice can spend the money because only she knows the private key s corresponding to the public spend key S where

SsG.

She knows

PkGsG = (ks)G

and so she has the private key (ks) corresponding to P.

Related posts

[1] Bob sends money to the address P, so there could be some connection between Bob and P on the Monero blockchain. However, due to another feature of Monero, namely ring signatures, someone analyzing the blockchain could only determine that Bob is one of 16 people who may have sent money to the address P, and there’s no way to know who received the money. That is, there is no way, using only information on the blockchain, who received the money. A private investigator who saw Bob walk into Alice’s restaurant would have additional information outside the blockchain.

[2] The key assumption of elliptic curve cryptography is that it’s computationally infeasible to “divide” on an elliptic curve, i.e. to recover a from knowledge of G and aG. You could recover a by brute force if the group were small, but the elliptic curve ed25519 has on the order of 2255 points, and a is some integer chosen randomly between 1 and the size of the curve.

Physical Keys and Encryption Keys

A physical key, such as a house key, is a piece of metal with cuts of differing depths. Typically there may be around 6 cuts, with five different possible depths for each cut. This allows 56 = 15,625 possible keys.

Encryption keys, such as AES keys, are a string of bits, often 128 bits, for a total of 2128 possible keys.

How long would a physical key have to be to have the same level of security as an encryption key? We’d need to solve

5n = 2128

which means

n = 128 / log25 = 55.12.

So we’d need a key with around 55 notches.

metal key with 55 notches

This only takes into account combinatorial possibilities, not the difficulty of attacking a physical key or a binary key. There are incomparably more possibilities for binary keys, but encryption attacks can be automated and carried out remotely (unless a computer is air gapped). A physical lock can only be attacked in person. It takes a lock picker orders of magnitude more time to try a key than a password cracking program. On the other hand, locks aren’t picked by trying thousands of keys.

Related post: Measuring cryptographic strength in liters of boiling water

Why and how Bitcoin uses Merkle trees

Yesterday’s post looked at a recently mined Bitcoin block, the 920,994th block in the blockchain, and verified that it contains the hash of the previous block. This post will look at the same block and verify its Merkle tree root.

Before getting down to the bytes, we’ll back up and say what a Merkle tree is and why Bitcoin uses one.

What is a Merkle tree?

A Merkle tree is a binary tree. The leaves of the tree are labeled with a hash of the data associated with each leaf. The label of each interior node is the hash of the concatenation of the two child hashes. The label of the root of the tree is the Merkle tree root.

The smallest change to any data in the tree will change the Merkle tree root [1]. If you weren’t concerned with efficiency, you wouldn’t need a Merkle tree. You could simply concatenate all the data in the leaves and compute a hash. But the Merkle tree makes it possible to verify a node without having to verify the whole tree.

To see this, imagine starting at the bottom of the tree and working your way up to compute the Merkle root. You need the hash value of the node you’re interested in and its sibling. Then you can compute the label of the parent node. If you know the label of the parent’s sibling node (you could call it the aunt or uncle) then you can compute the label of the grandparent node, and so forth until you get to the root. If you get the expected value for the root, the data is correct.

How Bitcoin uses Merkle trees

Bitcoin makes a Merkle tree with the data for each transaction being a leaf node. Simple Payment Verification (SPV) lets you verify a transaction knowing only the hash of the transaction, the hash of its sibling, and the hashes of the siblings moving up the tree: the aunts, great aunts, great great aunts etc.

If a block has 2n transactions, you only need n hash values to verify a single transaction. The binary tree has n levels, and you need two hashes at the bottom, and one hash for each of the levels between the bottom and the top. So in a block of 1024 transactions, you need 10 hash values. Note that this saves space two ways: you don’t need transaction data per se, only hashes of transaction data, and the number of hashes you need is log2 of the number of transactions.

What if the number of transactions is not a power of 2? If there is an odd number of nodes at a given level of the Merkle tree, the last node is paired with itself.

For example, the block that we’ll examine in detail contains 1279 transactions. At the bottom of our Merkle tree there are effectively 1280 nodes, with the 1280th node being a duplicate of the 1279th. Moving up the tree we have 640 nodes, 320 nodes, 160 nodes, … 10 nodes, and 5 nodes. The 5th node is paired with itself, and no the next level up has 3 nodes. The third node is paired with itself, making 2 nodes on the next level, and then the root above that.

Down to the bytes

The command

xxd -l 80 920994.dat

gives a hex dump of the block header:

0000 002e c2df b57e f582 75c2 7a64 33c5
edfd dfc1 af6a b9ae 708c 0000 0000 0000
0000 0000 bb15 accc 8940 3811 0b9b e1cf
b125 c0fa a65e fb1b 2fa4 61ed 989f 8d6f
e41e 04cf 5522 ff68 21eb 0117 f2bc ab1a

This time we’re looking at bytes 37 through 68, highlighted in orange. This is the Merkle tree root.

A miner verifying the block would recompute the Merkle tree root and confirm that it matches the value given in the header.

A Bitcoin block does not contain a Merkle tree per se, only its root. The hash of the data for each transaction, what Bitcoin calls a txid, is redundant since it can be computed. And if it were included in the block, a miner would need to recompute it to verify that it is correct.

We can fetch a list of txids for a block by calling

https://blockstream.info/api/block/{block_hash}/txids

with the hash of the block we want, which in our case is

00000000000000000001c2f89cee9f8c2fe88b4a93c2c2b75192918fc438ca32

This returns a JSON string with 1279 hash values:

fc1d784c62600565d4b4fbfe9cd45a7abc5f1c3273e294fd2b44450c5c9b9e9a
d4bce41744156a8f0e6588e42efef3519904a19169212c885930e908af1457ec
7bc6428c7d023108910a458c99359a2664de7c02343d96b133986825297e518d
…
c9132f178830d0c7a781e246bb5c2b3f4a9686c3d2b6f046b892a073d340eaff

Note that the API call did not pull the txids from the block; it computed them. Presumably the server computed the list of txids once and cached the results rather than computing them when we called the API.

We can compute the Merkle tree root by concatenating pairs of txids and hashing the results. Since the number of transactions is odd, the last txid is paired with itself. So moving one level up the tree, the first node is labeled with the hash of

fc1…e9ad4b…7ec

and the last node is labeled with the hash of

c91…affc91…aff

We continue this process until we compute the Merkle tree root. Then we can confirm that it matches the value given in the header. As with the block hash in the previous post, the Merkle tree root is stored in the header in little endian form.

Related posts

[1] There is a vanishingly small chance that a change in the data would not change the root. With a 256-bit hash, the chances of a change in the data not changing the root are 1 in 2256.

Random samples from a tetrahedron

Exactly one month ago I wrote about sampling points from a triangle. This post will look at the three dimensional analog, sampling from a tetrahedron. The generalization to higher dimensions works as well.

Sampling from a triangle

In the triangle post, I showed that a naive approach doesn’t work but a variation does. If the vertices of the triangle are AB, and C, then the naive approach is to generate three uniform random numbers, normalize them, and use them to form a linear combination of the vertices. That is, generate

r1r2r3

uniformly from [0, 1], then define

r1′ = r1/(r1r2r3)
r2′ = r2/(r1r2r3)
r3′ = r3/(r1r2r3)

and return as a random sample the point

r1′ A + r2′ B + r3C.

This oversamples points near the middle of the triangle and undersamples points near the vertices. But everything above works if you replace uniform samples with exponential samples.

Sampling from a tetrahedron

The analogous method works for a tetrahedron with vertices ABC, and D. Generate exponential random variables ri for i = 1, 2, 3, 4. Then normalize, defining each ri′ to be ri divided by the sum of all the ri s. Then the random sample is

r1′ A + r2′ B + r3C + r4D.

In the case of the triangle, it was easy to visualize that uniformly generated rs did not lead to uniform samples of the triangle, but that exponentially generated rs did. This is harder to do in three dimensions. Even if the points were uniformly sampled from a tetrahedron, they might not look uniformly distributed from a given perspective.

So how might we demonstrate that our method works? One approach would be to take a cube inside a tetrahedron and show that the proportion of samples that land inside the cube is what we’d expect, namely the ratio of the volume of the cube to the volume of the tetrahedron.

This brings up a couple interesting subproblems.

  1. How do we compute the volume of a tetrahedron?
  2. How can we test whether our cube lies entirely within the tetrahedron?

Volume of a tetrahedron

To find the volume of a tetrahedron, form a matrix M whose columns are the vectors from three of the vertices to the remaining vertex. So we could take AD, BD, and CD. The volume of the tetrahedron is the determinant of M divided by 6.

Testing whether a cube is inside a tetrahedron

A tetrahedron is convex, and so the line segment joining any two points inside the tetrahedron lies entirely inside the tetrahedron. It follows that if all the vertices of a cube are inside the a tetrahedron, the entire cube is inside.

So this brings us to the problem of testing whether a point is inside a tetrahedron. We can do this by converting the coordinates of the point to barycentric coordinates, then testing whether all the coordinates are non-negative and add to 1.

Random sampling illustration

I made up four points as vertices of a tetrahedron.

import numpy as np

A = [0, 0, 0]
B = [5, 0, 0]
C = [0, 7, 0]
D = [0, 1, 6]
A, B, C, D = map(np.array, [A, B, C, D])

For my test cube I chose the cube whose vertex coordinates are all combinations of 1 and 2. So the vertex nearest the origin is (1, 1, 1) and the vertex furthest from the origin is (2, 2, 2). You can verify that all the vertices are inside the tetrahedron, and so the cube is inside the tetrahedron.

The tetrahedron has volume 35, and the cube has volume 1, so we expect 1/35th of the random points to land inside the cube. Here’s the code to sample the tetrahedron and calculate the proportion of points that land in the cube.

from scipy.stats import expon

def incube(v):
    return 1 <= v[0] <= 2 and 1 <= v[1] <= 2 and 1 <= v[2] <= 2

def good_sample(A, B, C, D):
    r = expon.rvs(size=4)
    r /= sum(r)
    return r[0]*A + r[1]*B + r[2]*C + r[3]*D

def bad_sample(A, B, C, D):
    r = np.random.random(size=4)
    r /= sum(r)
    return r[0]*A + r[1]*B + r[2]*C + r[3]*D

N = 1_000_000

badcount  = 0
goodcount = 0

for _ in range(N):
    v = bad_sample(A, B, C, D)
    if incube(v):
        badcount += 1        
    v = good_sample(A, B, C, D)
    if incube(v):
        goodcount += 1

print(badcount/N, goodcount/N, 1/35)

This produced

0.094388 0.028372 0.028571…

The good (exponential) method match the expected proportion to three decimal places but the bad (uniform) method put about 3.3 times as many points in the cube as expected.

Note that three decimal agreement is about what we’d expect via the central limit theorem since 1/√N = 0.001.

If you’re porting the sampler above to an environment where you don’t a function for generating exponential random samples, you can roll your own by returning −log(u) where u is a uniform sample from [0. 1].

More general 3D regions

After writing the post about sampling from triangles, I wrote a followup post about sampling from polygons. In a nutshell, divide your polygon into triangles, then randomly select a triangle in proportion to its area, then sample with that triangle. You can do the analogous process in three dimensions by dividing a general volume into tetrahedra.

Fitting a double exponential function to three points

The previous post needed to fit a double exponential function to data, a function of the form

a \exp(b \exp(cx))

By taking logs, we have a function of the form

f(x) = a + b \exp(cx)

where the a in the latter equation is the log of the a in the former.

In the previous post I used a curve fitting function from SciPy to find the parameters ab, and c. The solution in the post works, but I didn’t show all the false starts along the way. I ran into problems with overflow and with the solution method not converging before I came up with the right formulation of the problem and a good enough initial guess at the solution.

This post will look at fitting the function more analytically. I’ll still need to solve an equation numerically, but an equation in only one variable, not three.

If you need to do a regression, fitting a function with noisy data to more than three points, you could use the code below with three chosen data points to find a good starting point for a curve-fitting method.

Solution

Suppose we have x1 < x2 < x3 and we want to find a, b, and c such that

\begin{align*} y_1 &= a + b \exp(c x_1) \\ y_2 &= a + b \exp(c x_2) \\ y_3 &= a + b \exp(c x_3) \\ \end{align*}

where y1 < y2 < y3.

Subtract the equations pairwise to eliminate a:

\begin{align*} y_2 - y_1 &= b (\exp(c x_2) - \exp(c x_1)) \\ y_3 - y_1 &= b (\exp(c x_3) - \exp(c x_1)) \\ \end{align*}

Divide these to eliminate b:

\frac{y_2 - y_1}{y_3 - y_1} = \frac{\exp(c x_2) - \exp(c x_1)}{\exp(c x_3) - \exp(c x_1)}

Factor out exp(cx1) from the right side:

\frac{y_2 - y_1}{y_3 - y_1} = \frac{1 - \exp(c (x_2 - x_1))}{1 - \exp(c (x_3 - x_1))}

Now the task is to solve

L = \frac{1 - \exp(c d_2)}{1 - \exp(c d_3)}

where

\begin{align*} L &= \frac{y_2 - y_1}{y_3 - y_1} \\ d_2 &= x_2 - x_1 \\ d_3 &= x_3 - x_1 \end{align*}

Note that L, d2, and d3 are all positive.

Once we find c numerically,

b = \frac{y_2 - y_1}{\exp(c x_2) - \exp(c x_1)}

and

a = y_1 - b \exp(c x_1)

Python code

Here’s a Python implementation to illustrate and test the derivation above.

from scipy import exp
from scipy.optimize import newton

def fit(x1, x2, x3, y1, y2, y3):
    assert(x1 < x2 < x3)
    assert(y1 < y2 < y3)

    d2 = x2 - x1
    d3 = x3 - x1
    L = (y2 - y1)/(y3 - y1)
    g = lambda x: L - (1 - exp(x*d2))/(1 - exp(x*d3))
    c = newton(g, 0.315)

    b = (y2 - y1)/(exp(c*x2) - exp(c*x1))
    a = y1 - b*exp(c*x1)
    return a, b, c

a, b, c = fit(9, 25, 28, 42, 55, 92)
print([a + b*exp(c*x) for x in [9, 25, 28]])