From logistic regression to AI

It is sometimes said that neural networks are “just” logistic regression. (Remember neural networks? LLMs are neural networks, but nobody talks about neural networks anymore.) In some sense a neural network is logistic regression with more parameters, a lot more parameters, but more is different. New phenomena emerge at scale that could not have been anticipated at a smaller scale.

Logistic regression can work surprisingly well on small data sets. One of my clients filed a patent on a simple logistic regression model I created for them. You can’t patent logistic regression—the idea goes back to the 1840s—but you can patent its application to a particular problem. Or at least you can try; I don’t know whether the patent was ever granted.

Some of the clinical trial models that we developed at MD Anderson Cancer Center were built on Bayesian logistic regression. These methods were used to run early phase clinical trials, with dozens of patients. Far from “big data.” Because we had modest amounts of data, our models could not be very complicated, though we tried. The idea was that informative priors would let you fit more parameters than would otherwise be possible. That idea was partially correct, though it leads to a sensitive dependence on priors.

When you don’t have enough data, additional parameters do more harm than good, at least in the classical setting. Over-parameterization is bad in classical models, though over-parameterization can be good for neural networks. So for a small data set you commonly have only two parameters. With a larger data set you might have three or four.

There is a rule of thumb that you need at least 10 events per parameter (EVP) [1]. For example, if you’re looking at an outcome that happens say 20% of the time, you need about 50 data points per parameter. If you’re analyzing a clinical trial with 200 patients, you could fit a four-parameter model. But those four parameters better pull their weight, and so you typically compute some sort of information criteria metric—AIC, BIC, DIC, etc.—to judge whether the data justify a particular set of parameters. Statisticians agonize over each parameter because it really matters.

Imaging working in the world of modest-sized data sets, carefully considering one parameter at a time for inclusion in a model, and hearing about people fitting models with millions, and later billions, of parameters. It just sounds insane. And sometimes it is insane [2]. And yet it can work. Not automatically; developing large models is still a bit of a black art. But large models can do amazing things.

How do LLMs compare to logistic regression as far as the ratio of data points to parameters? Various scaling laws have been suggested. These laws have some basis in theory, but they’re largely empirical, not derived from first principles. “Open” AI no longer shares stats on the size of their training data or the number of parameters they use, but other models do, and as a very rough rule of thumb, models are trained using around 100 tokens per parameter, which is not very different from the EVP rule of thumb for logistic regression.

Simply counting tokens and parameters doesn’t tell the full story. In a logistic regression model, data are typically binary variables, or maybe categorical variables coming from a small number of possibilities. Parameters are floating point values, typically 64 bits, but maybe the parameter values are important to three decimal places or 10 bits. In the example above, 200 samples of 4 binary variables determine 4 ten-bit parameters, so 20 bits of data for every bit of parameter. If the inputs were 10-bit numbers, there would be 200 bits of data per parameter.

When training an LLM, a token is typically a 32-bit number, not a binary variable. And a parameter might be a 32-bit number, but quantized to 8 bits for inference [3]. If a model uses 100 tokens per parameter, that corresponds to 400 bits of training data per inference parameter bit.

In short, the ratio of data bits to parameter bits is roughly similar between logistic regression and LLMs. I find that surprising, especially because there’s a sort of no man’s land between [2] a handful of parameters and billions of parameters.

Related posts

[1] P Peduzzi 1, J Concato, E Kemper, T R Holford, A R Feinstein. A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology 1996 Dec; 49(12):1373-9. doi: 10.1016/s0895-4356(96)00236-3.

[2] A lot of times neural networks don’t scale down to the small data regime well at all. It took a lot of audacity to believe that models would perform disproportionately better with more training data. Classical statistics gives you good reason to expect diminishing returns, not increasing returns.

[3] There has been a lot of work lately to find low precision parameters directly. So you might find 16-bit parameters rather than finding 32 bit parameters then quantizing to 16 bits.

Differential equation with a small delay

In grad school I specialized in differential equations, but never worked with delay-differential equations, equations specifying that a solution depends not only on its derivatives but also on the state of the function at a previous time. The first time I worked with a delay-differential equation would come a couple decades later when I did some modeling work for a pharmaceutical company.

Large delays can change the qualitative behavior of a differential equation, but it seems plausible that sufficiently small delays should not. This is correct, and we will show just how small “sufficiently small” is in a simple special case. We’ll look at the equation

x′(t) = a x(t) + b x(t − τ)

where the coefficients a and b are non-zero real constants and the delay τ is a positive constant. Then [1] proves that the equation above has the same qualitative behavior as the same equation with the delay removed, i.e. with τ = 0, provided τ is small enough. Here “small enough” means

−1/e exp(−aτ) < e

and

aτ < 1.

There is a further hypothesis for the theorem cited above, a technical condition that holds on a nowhere dense set. The solution to a first order delay-differential like the one we’re looking at here is not determined by an initial condition x(0) = x0 alone. We have to specify the solution over the interval [−τ, 0]. This can be any function of t, subject only to a technical condition that holds on a nowhere-dense set of initial conditions. See [1] for details.

Example

Let’s look at a specific example,

x′(t) = −3 x(t) + 2 x(t − τ)

with the initial condition x(1) = 1. If there were no delay term τ, the solution would be x(t) = exp(1 − t). In this case the solution monotonically decays to zero.

The theorem above says we should expect the same behavior as long as

−1/e < 2τ exp(3τ) < e

which holds as long as τ < 0.404218.

Let’s solve our equation for the case τ = 0.4 using Mathematica.

tau = 0.4
solution = NDSolveValue[
    {x'[t] == -3 x[t] + 2 x[t - tau], x[t /; t <= 1] == t }, 
    x, {t, 0, 10}]
Plot[solution[t], {t, 0, 10}, PlotRange -> All]

This produces the following plot.

The solution initially ramps up to 1, because that’s what we specified, but it seems that eventually the solution monotonically decays to 0, just as when τ = 0.

When we change the delay to τ = 3 and rerun the code we get oscillations.

[1] R. D. Driver, D. W. Sasser, M. L. Slater. The Equation x’ (t) = ax (t) + bx (t – τ) with “Small” Delay. The American Mathematical Monthly, Vol. 80, No. 9 (Nov., 1973), pp. 990–995

Shell variable ~-

After writing the previous post, I poked around in the bash shell documentation and found a handy feature I’d never seen before, the shortcut ~-.

I frequently use the command cd - to return to the previous working directory, but didn’t know about ~- as a shotrcut for the shell variable $OLDPWD which contains the name of the previous working directory.

Here’s how I will be using this feature now that I know about it. Fairly often I work in two directories, and moving back and forth between them using cd -, and need to compare files in the two locations. If I have files in both directories with the same name, say notes.org, I can diff them by running

    diff notes.org ~-/notes.org

I was curious why I’d never run into ~- before. Maybe it’s a relatively recent bash feature? No, it’s been there since bash was released in 1989. The feature was part of C shell before that, though not part of Bourne shell.

Working with file extensions in bash scripts

I’ve never been good at shell scripting. I’d much rather write scripts in a general purpose language like Python. But occasionally a shell script can do something so simply that it’s worth writing a shell script.

Sometimes a shell scripting feature is terse and cryptic precisely because it solves a common problem succinctly. One example of this is working with file extensions.

For example, maybe you have a script that takes a source file name like foo.java and needs to do something with the class file foo.class. In my case, I had a script that takes a La TeX file name and needs to create the corresponding DVI and SVG file names.

Here’s a little script to create an SVG file from a LaTeX file.

    #!/bin/bash

    latex "$1"
    dvisvgm --no-fonts "${1%.tex}.dvi" -o "${1%.tex}.svg"

The pattern ${parameter%word} is a bash shell parameter expansion that removes the shortest match to the pattern word from the end of the expansion of parameter. So if $1 equals foo.tex, then

    ${1%.tex}

evaluates to

    foo

and so

${1%.tex}.dvi

and

${1%.tex}.svg

expand to foo.dvi and foo.svg.

You can get much fancier with shell parameter expansions if you’d like. See the documentation here.

Hyperbolic versions of latest posts

The post A curious trig identity contained the theorem that for real x and y,

|\sin(x + iy)| = |\sin x + \sin iy|

This theorem also holds when sine is replaced with hyperbolic sine.

The post Trig of inverse trig contained a table summarizing trig functions applied to inverse trig functions. You can make a very similar table for the hyperbolic counterparts.

\renewcommand{\arraystretch}{2.2} 
\begin{array}{c|c|c|c}
 & \sinh^{-1} & \cosh^{-1} & \tanh^{-1} \\ \hline
\sinh & x & \sqrt{x^{2}-1} & \dfrac{x}{\sqrt{1-x^2}} \\ \hline
\cosh & \sqrt{x^{2} + 1} & x & \dfrac{1}{\sqrt{1 - x^2}} \\ \hline
\tanh & \dfrac{x}{\sqrt{x^{2}+1}} & \dfrac{\sqrt{x^{2}-1}}{x} & x \\
\end{array}

The following Python code doesn’t prove that the entries in the table are correct, but it likely would catch typos.

    from math import *

    def compare(x, y):
        print(abs(x - y) < 1e-12)

    for x in [2, 3]:
        compare(sinh(acosh(x)), sqrt(x**2 - 1))
        compare(cosh(asinh(x)), sqrt(x**2 + 1))
        compare(tanh(asinh(x)), x/sqrt(x**2 + 1))
        compare(tanh(acosh(x)), sqrt(x**2 - 1)/x)                
    for x in [0.1, -0.2]:
        compare(sinh(atanh(x)), x/sqrt(1 - x**2))
        compare(cosh(atanh(x)), 1/sqrt(1 - x**2)) 

Related post: Rule for converting trig identities into hyperbolic identities

Trig of inverse trig

I ran across an old article [1] that gave a sort of multiplication table for trig functions and inverse trig functions. Here’s my version of the table.

\renewcommand{\arraystretch}{2.2} 
\begin{array}{c|c|c|c}
 & \sin^{-1} & \cos^{-1} & \tan^{-1} \\ \hline
\sin & x & \sqrt{1-x^{2}} & \dfrac{x}{\sqrt{1+x^2}} \\ \hline
\cos & \sqrt{1-x^{2}} & x & \dfrac{1}{\sqrt{1 + x^2}} \\ \hline
\tan & \dfrac{x}{\sqrt{1-x^{2}}} & \dfrac{\sqrt{1-x^{2}}}{x} & x \\
\end{array}

I made a few changes from the original. First, I used LaTeX, which didn’t exist when the article was written in 1957. Second, I only include sin, cos, and tan; the original also included csc, sec, and cot. Third, I reversed the labels of the rows and columns. Each cell represents a trig function applied to an inverse trig function.

The third point requires a little elaboration. The table represents function composition, not multiplication, but is expressed in the format of a multiplication table. For the composition fg(x) ), do you expect f to be on the side or top? It wouldn’t matter if the functions commuted under composition, but they don’t. I think it feels more conventional to put the outer function on the side; the author make the opposite choice.

The identities in the table are all easy to prove, so the results aren’t interesting so much as the arrangement. I’d never seen these identities arranged into a table before. The matrix of identities is not symmetric, but the 2 by 2 matrix in the upper left corner is because

sin(cos−1(x)) = cos(sin−1(x)).

The entries of the third row and third column are not symmetric, though they do have some similarities.

You can prove the identities in the sin, cos, and tan rows by focusing on the angles θ, φ, and ψ below respectively because θ = sin−1(x), φ = cos−1(x), and ψ = tan−1(x). This shows that the square roots in the table above all fall out of the Pythagorean theorem.

See the next post for the hyperbolic analog of the table above.

[1] G. A. Baker. Multiplication Tables for Trigonometric Operators. The American Mathematical Monthly, Vol. 64, No. 7 (Aug. – Sep., 1957), pp. 502–503.

A curious trig identity

Here is an identity that doesn’t look correct but it is. For real x and y,

|\sin(x + iy)| = |\sin x + \sin iy|

I found the identity in [1]. The author’s proof is short. First of all,

\begin{align*} \sin(x + iy) &= \sin x \cos iy + \cos x \sin iy \\ &= \sin x \cosh y + i \cos x \sinh y \end{align*}

Then

\begin{align*} |\sin(x + iy)|^2 &= \sin^2 x \cosh^2 y + \cos^2 x \sinh^2 y \\ &= \sin^2 x (1 + \sinh^2 y) + (1 -\sin^2x) \sinh^2 y \\ &= \sin^2 x + \sinh^2 y \\ &= |\sin x + i \sinh y|^2 \\ &= |\sin x + \sin iy|^2 \end{align*}

Taking square roots completes the proof.

Now note that the statement at the top assumed x and y are real. You can see that this assumption is necessary by, for example, setting x = 2 and yi.

Where does the proof use the assumption that x and y are real? Are there weaker assumptions on x and y that are sufficient?

 

[1] R. M. Robinson. A curious trigonometric identity. American Mathematical Monthly. Vol 64, No 2. (Feb. 1957). pp 83–85

Copy and paste law

I was doing some research today and ran into a couple instances where part of one law was copied and pasted verbatim into another law. I suppose this is not uncommon, but I’m not a lawyer, so I don’t have that much experience comparing laws. I do, however, consult for lawyers and have to look up laws from time to time.

Here’s an example from California Health and Safety Code § 1385.10 and the California Insurance Code § 10181.10.

The former says

The health care service plan shall obtain a formal determination from a qualified statistician that the data provided pursuant to this subdivision have been deidentified so that the data do not identify or do not provide a reasonable basis from which to identify an individual. If the statistician is unable to determine that the data has been deidentified, the health care service plan shall not provide the data that cannot be deidentified to the large group purchaser. The statistician shall document the formal determination in writing and shall, upon request, provide the protocol used for deidentification to the department.

The latter says the same thing, replacing “health care service plan” with “health insurer.”

The health insurer shall obtain a formal determination … health insurer shall not provide the data … for deidentification to the department.

I saved the former in a file cal1.txt and the latter in cal2.txt and verified that the files were the same, with a search and replace, using the following shell one-liner:

diff <(sed 's/care service plan/insurer/g' cal1.txt) cal2.txt

I ran into this because I often provide statistical determination of deidentification, though usually in the context of HIPAA rather than California safety or insurance codes.

Related posts

Giant Steps

John Coltrane’s song Giant Steps is known for its unusual and difficult chord changes. Although the chord progressions are complicated, there aren’t that many unique chords, only nine. And there is a simple pattern to the chords; the difficulty comes from the giant steps between the chords.

Giant Steps chords

If you wrap the chromatic scale around a circle like a clock, there is a three-fold symmetry. There is only one type of chord for each root, and the three notes not represented are evenly spaced. And the pattern of the chord types going around the circle is

minor 7th, dominant 7th, major 7th, skip
minor 7th, dominant 7th, major 7th, skip
minor 7th, dominant 7th, major 7th, skip

To be clear, this is not the order of the chords in Giant Steps. It’s the order of the sorted list of chords.

For more details see the video The simplest song that nobody can play.

Related posts

Tritone substitution

Big moves in roots can correspond to small moves in chords.

Imagine the 12 notes of a chromatic scale arranged around the hours of a clock: C at 12:00, C♯ at 1:00, D at 2:00, etc. The furthest apart two notes can be is 6 half steps, just as the furthest apart two times can be is 6 hours.

Musical clock

An interval of 6 half steps is called a tritone. That’s a common term in jazz. In classical music you’d likely say augmented fourth or diminished fifth. Same thing.

The largest possible movement in roots corresponds to almost the smallest possible movement between chords. Specifically, to go from a dominant seventh chord to another dominant seventh chord whose roots are a tritone apart only requires moving two notes of the chord a half step each.

For example, C and F♯ are a tritone apart, but a C7 chord and a F♯7 chord are very close together. To move from the former to the latter you only need to move two notes a half step.

Musical clock

Replacing a dominant seventh chord with one a tritone away is called a tritone substitution, or just tritone sub. It’s called this for two reasons. The root moves a tritone, but also the tritone inside the chord does not move. In the example above, the third and the seventh of the C7 chord become the seventh and third of the F♯7 chord. On the diagram, the dots at 4:00 and 10:00 don’t move.

Tritone substitutions are a common technique for making basic chord progressions more sophisticated. A common tritone sub is to replace the V of a ii-V-I chord progression, giving a nice chromatic progression in the bass line. For example, in the key of C, a D min – G7– C progression becomes D min – D♭7 – C.

Related posts