Beta inequality symmetries

I was thinking about the work I did when I worked in biostatistics at MD Anderson. This work was practical rather than mathematically elegant, useful in its time but not of long-term interest. However, one result came out of this work that I would call elegant, and that was a symmetry I found.

Let X be a beta(a, b) random variable and let Y be a beta(c, d) random variable. Let g(a, b, c, d) be the probability that a sample from X is larger than a sample from Y.

g(a, b, c, d) = Prob(X > Y)

This function often appeared in the inner loop of a simulation and so we spent thousands of CPU-hours computing its values. I looked for ways to evaluate this function more quickly, and along the way I found a symmetry.

The function I call g was studied by W. R. Thompson in 1933 [1]. Thompson noted two symmetries:

g(a, b, c, d) = 1 − g(c, d, a, b)

and

g(a, b, c, d) = g(d, c, b, a)

I found an additional symmetry:

g(a, b, c, d) = g(d, b, c, a).

The only reference to this result in a journal article as far as I know is a paper I wrote with Saralees Nadarajah [2]. That paper cites an MD Anderson technical report which is no longer online, but I saved a copy here.

Related posts

[1] W. R. Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika, Volume 25, Issue 4. pp. 285 – 294.

[2] John D. Cook and Saralees Nadarajah. Stochastic Inequality Probabilities for Adaptively Randomized Clinical Trials. Biometrical Journal. 48 (2006) pp 256–365.

 

When is a function of two variables separable?

Given a function f(xy), how can you tell whether f can be factored into the product of a function g(x) of x alone and a function h(y) of y alone? Depending on how an expression for f is written, it may or may not be obvious whether f(x, y) can be separated into g(x) h(y).

There are several situations in which you might want to know whether a function is separable. For example, the ordinary differential equation

y′ = f(x, y)

can be solved easily when f(x, y) = g(x) h(y).

You might want to do something similar for a partial differential equation, using separation of variables, possibly choosing a coordinate system that allows the separation of variables trick to work.

Aside from applications to differential equations, you might want to know whether a polynomial in two variables can be factored into the product of polynomials in each variable separately.

In [1] David Scott gives a simple necessary condition for f to be separable:

f fxy = fx fy

Here the subscripts indicate partial derivatives.

It’s easy to see this condition is necessary. Scott shows the condition is also sufficient under some mild technical assumptions.

As an example, determine the value of k such that the differential equation

y′ = 6xy² + 3y² −4x + k

is separable.

Scott’s equation

f fxy = fx fy

becomes

(6xy² + 3y² −4x + k)(12y) = (6y² −4)(12xy + 6y)

which holds if and only if k = −2.

Related posts

[1] David Scott. When is an Ordinary Differential Equation Separable? The American Mathematical Monthly, Vol. 92, No. 6, pp. 422–423

Applications of Bernoulli differential equations

When a nonlinear first order ordinary differential equation has the form

\frac{dy}{dx} + P(x)\,y = Q(x)\, y^n

with n ≠ 1, the change of variables

u = y^{1-n}

turns the equation into a linear equation in u. The equation is known as Bernoulli’s equation, though Leibniz came up with the same technique. Apparently the history is complicated [1].

It’s nice that Bernoulli’s equation can be solve in closed form, but is it good for anything? Other than doing homework in a differential equations course, is there any reason you’d want to solve Bernoulli’s equation?

Why yes, yes there is. According to [1], Bernoulli’s equation is a generalization of a class of differential equations that came out of geometric problems.

Someone asked about applications of Bernoulli’s equation on Stack Exchange and got a couple interesting answers.

The first answer said that a Bernoulli equation with n = 3 comes up in modeling frictional forces. See also this post on drag forces.

The second answer links to a paper on Bernoulli memristors.

Related posts

[1] Adam E. Parker. Who Solved the Bernoulli Differential Equation and How Did They Do It? College Mathematics Journal, vol. 44, no. 2, March 2013.

The IQ Test That AI Can’t Pass

Large language models have recently achieved remarkable test scores on well-known academic and professional exams (see, e.g., [1], p. 6). On such tests, these models are at times said to reach human-level performance. However, there is one test that humans can pass but every AI method known to have been tried has abysmally failed.

The Abstraction and Reasoning Corpus (ARC) benchmark [2] was developed by François Chollet to measure intelligence in performing tasks never or rarely seen before. We all do tasks something like this every day, like making a complicated phone call to correct a mailing address. The test is composed of image completion problems similar to Raven’s Progressive Matrices but more complex. Given images A, B and C, one must identify the image D such that the relationship “A is to B as C is to D” holds. Sometimes several examples of the A:B relationship are given.

The problem is hard because the relationship patterns between A and B that humans could easily identify (for example, image shrinking, rotating, folding, recoloring, etc.) might be many, many different things—more than can easily be trained for. By construction, every problem is qualitatively, unpredictably different, so the common approach of training on the training set doesn’t work. Instead, bonafide reasoning on a new kind of problem for each case is required.

Several competitions with prize money have encouraged progress on the ARC benchmark [3]. In these, each entrant’s algorithm must be tested against an unseen ARC holdout set. The leaderboard for the ARCathon 2023 challenge completed last month shows top score of 30 percent [4]; this is excellent progress on a very hard problem, but far from a perfect score or anything else resembling passing.

Ilya Sutskever has famously warned we shouldn’t bet against deep learning, and perhaps a future LLM will do much better on this benchmark. Others feel a new approach is needed, for example, from the burgeoning field of neurosymbolic methods. In any case, these results show at the present moment in this rapidly progressing field, we don’t seem to be anywhere close to strong forms of AGI, artificial general intelligence.

[1] OpenAI, “GPT-4 Technical Report,” https://cdn.openai.com/papers/gpt-4.pdf

[2] François Chollet, “On the Measure of Intelligence,” https://arxiv.org/abs/1911.01547

[3] “Abstraction and Reasoning Challenge,” https://www.kaggle.com/c/abstraction-and-reasoning-challenge

[4] “Winners – Lab42, “https://lab42.global/past-challenges/arcathon-2023/

Means of means bounding the logarithmic mean

The geometric, logarithmic, and arithmetic means of a and b are defined as follows.

\begin{align*} G &= \sqrt{ab} \\ L &= \frac{b - a}{\log b - \log a} \\ A &= \frac{a + b}{2} \end{align*}

A few days ago I mentioned that GLA. The logarithmic mean slips between the geometric and arithmetic means.

Or to put it another way, the logarithmic mean is bounded by the geometric and arithmetic means. You can bound the logarithmic mean more tightly with a mixture of the geometric and arithmetic means.

In [1] the authors show that

G^{2/3} A^{1/3} \leq L \leq \tfrac{2}{3}G + \tfrac{1}{3}A

Note that the leftmost expression is the geometric mean of G, G, and A, and the rightmost expression is the arithmetic mean of G, G, and A. We can write this as

G(G, G, A) \leq L \leq A(G, G, A)

where G with no argument is the geometric mean of a and b and G with three arguments is the geometric mean of those arguments, and similarly for A.

The following plot shows how well these means of means bound the logarithmic mean. We let a = 1 and let b vary from 1 to 10o.

The upper bound is especially tight for moderate values of b. When I first made the plot I let b run up to 10 and there were apparently only four curves in the plot. I had to pick a larger value of b before the curves for L and (2G + A)/3 to be distinguished.

Related posts

[1] Graham Jameson and Peter R. Mercer. The Logarithmic Mean Revisited. The American Mathematical Monthly, Vol. 126, No. 7, pp. 641-645

Base 64 encoding remainder problem

I’ve mentioned base 64 encoding a few times here, but I’ve left out a detail. This post fills in that detail.

Base 64 encoding comes up in multiple contexts in which you want to represent binary data in text form. I’ve mentioned base 64 encoding in the context of Gnu ASCII armor. A more common application is MIME (Multipurpose Internet Mail Extensions) encoding.

Base 64 encoding uses 64 characters (A…Z, a…z, 0…9, +, and /) to represent six bits at a time.

In the previous post I showed how ASCII armor encoded a 91,272 byte JPEG image into a text file and how it could convert the text back into binary. The number of bytes in the file a multiple of 3, which you could quickly confirm by casting out nines.

If the number of bytes in a file is not a multiple of three, the number of bits is not a multiple of six, and so we have to do something with the remainder.

For an example, let’s start with a file containing the bits

000000 000001 000010 000011 000100 000101 000110 000111

If we run gpg --enarmor on this file, we get

ABCDEFGH
=/u99

and some extra text for human consumption. The base 64 encoding is ABCDEFGH, the equal sign is a separator, and /u99 is a checksum.

If we delete the last 8 bits from our file and run ASCII armor again, we get

ABCDEFE=
=/IPL

The second equal sign separates the base 64 encoding from the checksum, but the first equal sign is something new.

If we chop eight more bits off the end of the file we get

ABCDEA==
=izh9

What’s going on here? In each case, the first 30 bits are being encoded as ABCDE. The remaining bits are

000101 000110 000111

When we cut off the last 8 bits we were left with

000101 0001

The bits 00101 are encoded as F, and the last four bits are padded to 00100 and encoded as E. The trailing equal sign is a signal that two bits were added as padding.

When we cut off 8 more bits we were left with

00

which was padded to 000000 and encoded as A. Then two equal signs were added to signal that four bits were added as padding.

So the rule is add two or four 0s at the end to make the number of bits a multiple of six. Then add an equal sign for each pair of bits added.

Since file sizes are multiples of bytes, and a byte is 8 bits, the number of bits in a file is always even. This means the remainder when the number of bits is divided by 6 is 0, 2, or 4. So if we add padding, we only add two or four zero bits and never an odd number of bits.

Binary to text to binary

Gnu Privacy Guard includes a way to encode binary files as plain ASCII text files, and turn these text files back into binary. This is intended as a way to transmit encrypted data, but it can be used to convert any kind of file from binary to text and back to binary.

To illustrate this, I’ll use Albrecht Dürer’s Melencolia I as an example. (More on this image and its mathematical significance here.)

Albrecht Dürer’s engraving Melencolia I

This image is saved in the file Melancholia.jpg.

Binary to text

If we run

   gpg --enarmor Melencolia.jpg

at a command line, it produces a file Melancholia.jpg.asc, the asc suffix indicating an ASCII file.

We can look inside this file if we’d like.

-----BEGIN PGP ARMORED FILE-----
Comment: Use "gpg --dearmor" for unpacking

/9j/4AAQSkZJRgABAQEBLAEsAAD/4QBoRXhpZgAATU0AKgAAAAgABAEaAAUAAAAB
AAAAPgEbAAUAAAABAAAARgEoAAMAAAABAAIAAAExAAIAAAARAAAATgAAAAAAAAEs
AAAAAQAAASwAAAABUGFpbnQuTkVUIHYzLjUuOAAA/9sAQwACAQECAQECAgICAgIC
...
wJ/wkzRf8IN4LCCVVwNCtM8sDnPl5zkk565NbcH7Lnw01K0gtrn4feCriB4QfLl0
S2dV+bdgApwMknAwMk+tFFcktjqif//Z
=rpd1
-----END PGP ARMORED FILE-----

The cryptic text /9j/4A…//Z is a base 64 encoded representation of the binary file. Think of the file as one big binary number. Write that number in base 64, i.e. partition the bits into groups of six. Then represent the base 64 “digits” as alphanumeric characters, plus the symbols + and /. More on the details here.

The line =rpd1 is a 24-bit CRC checksum. The equal sign is a separator, and rpd1 is a base 64 encoding the checksum.

The JPG file is 91,272 bytes and the ASCII file is 123,712 bytes. The ASCII file is about 1/3 larger because every six bits in the binary file corresponds to an eight-bit ASCII character. The ASCII file is a little bit more than 1/3 larger because of the human-friendly text above and below the base 64 encoding, the newline characters, and the checksum.

Text to binary

If we run

    gpg --dearmor Melencolia.jpg.asc

at a command line, it produces a file Melancholia.jpg.asc.gpg. This file is bit-for-bit exactly the same as the original file, which we could confirm by running

    diff Melencolia.jpg Melencolia.jpg.asc.gpg

Related posts

Why “a caret, euro, trademark” ’ in a file?

A with caret, euro, trademark

Why might you see ’ in the middle of an otherwise intelligible file? The reason is very similar to the reason you might see �, which I explained in the previous post. You might want to read that post first if you’re not familiar with Unicode and character encodings.

It all has to do with an encoding error, probably. Not necessarily, since, for example, I deliberately put ’ in the opening sentence. But assuming it is an error, it’s likely an encoding error.

But it’s the opposite of the � error. The � occurs when non- UTF-8 text has been declared (or implicitly interpreted as) Unicode. In particular, you can run into this error if text encoded in ISO 8859-1 is interpreted as as UTF-8.

The ’ sequence is usually the opposite: UTF-8 encoded text is being interpreted as Windows-1252 (a.k.a. CP-1252) encoded text. In particular, a single quote (U+2019) encoded in UTF-8 has been interpreted as the Windows-1252 text ’.

Windows-1252 is a superset of IDO 8859-1, the error resulting in � could also be described as a Windows-1252 error. So a � means Windows-1252 text has been interpreted as UTF-8, and ’ means UTF-8 has been interpreted as Windows-1252. In the former case there is an invalid character. In the latter case all the characters are valid, though they’re not the characters you were supposed to see.

You can fix the error by making your content and your encoding match. Or remove the offending character, replacing the single quote with ’.

You can find more details in this Stack Overflow post.

A valid character to represent an invalid character

U+FFFD REPLACEMENT CHARACTER

You may have seen a web page with the symbol � scattered throughout the text, especially in older web pages. What is this symbol and why does it appear unexpected?

The symbol we’re discussing is a bit of a paradox. It’s the (valid) Unicode character to represent an invalid Unicode character. If you just read the first sentence of this post, without inspecting the code point values, you can’t tell whether the symbol appears because I’ve made a mistake, or whether I’ve deliberately included the symbol.

The symbol in question is U+FFFD, named REPLACEMENT CHARACTER, a perfectly valid Unicode character. But unlike this post, you’re most likely to see it when the author did not intend for you to see it. What’s going on?

It all has to do with character encoding.

If all you want to do is represent Roman letters and common punctuation marks, and control characters, ASCII is fine. There are 128 ASCII characters, so they fit into a single 8-bit byte. But as soon as you want to write façade, jalapeño, or Gödel you have a problem. And of course you have a bigger problem if your language doesn’t use the Roman alphabet at all.

ASCII wastes one bit per byte, so naturally people wanted to take advantage of that extra bit to represent additional characters, such as the ç, ñ, and ö above. One popular way of doing this was described in the standard ISO 8859-1.

Of course there are other ways of encoding characters. If your language is Russian or Hebrew or Chinese, you’re no happier with ISO 8859-1 than you are with ASCII.

Enter Unicode. Let’s represent all the word’s alphabets (and ideograms and common symbols and …) in a single system. Great idea, though there are a mind-boggling number of details to work out. Even so, once you’ve assigned a number to every symbol that you care about, there’s still more work to do.

You could represent every character with two bytes. Now you can represent 65,536 characters. That’s too much and too little. If you want to represent text that is essentially made of Roman letters plus occasional exotic characters, using two bytes per letter makes the text 100% larger.

And while 65,536 sounds like a lot of characters, it’s not enough to represent every Chinese character, much less all the characters in other languages. Turns out we need four bytes to do what Unicode was intended to do.

So now we have to deal with encodings. UTF-8 is a brilliant solution to this problem. It can handle all Unicode characters, but if text is just ASCII, it won’t be any longer: ASCII is a valid UTF-8 encoding of the subset of Unicode that belonged to ASCII.

But there were other systems before Unicode, like ISO 8859-1, and if your file is encoded as ISO 8859-1, but a web browser thinks its UTF-8 encoded Unicode, some characters could be invalid. Browsers will use the character � as a replacement for invalid text that it could not otherwise display. That’s probably what’s going on when you see �.

See the article on How UTF-8 works to understand why some characters are not just a different character than intended but illegal UTF-8 byte sequences.