# ODE solver as a functional fold

One of the most widely used numerical algorithms for solving differential equation is the 4th order Runge-Kutta method. This post shows how the Runge-Kutta method can be written naturally as a fold over the set of points where the solution is needed. These points do not need to be evenly spaced.

Given a differential equation of the form

with initial condition y(t0) = y0, the 4th order Runge-Kutta method advances the solution by an amount of time h by

where

The Haskell code for implementing the accumulator function for foldl looks very much like the mathematical description above. This is a nice feature of the where syntax.

    rk (t, y) t' = (t', y + h*(k1 + 2.0*k2 + 2.0*k3 + k4)/6.0)
where
h  = t' - t
k1 = f t y
k2 = f (t + 0.5*h) (y + 0.5*h*k1)
k3 = f (t + 0.5*h) (y + 0.5*h*k2)
k4 = f (t + 1.0*h) (y + 1.0*h*k3)


Suppose we want to solve the differential equation y ‘ = (t2y2) sin(y) with initial condition y(0) = -1, and we want to approximate the solution at [0.01, 0.03, 0.04, 0.06]. We would implement the right side of the equation as

     f t y = (t**2 - y**2)*sin(y)

and fold the function rk over our time steps with

    foldl rk (0, -1) [0.01, 0.03, 0.04, 0.06]

This returns (0.06, -0.9527). The first part, 0.06, is no surprise since obviously we asked for the solution up to 0.06. The second part, -0.9527, is the part we’re more interested in.

If you want to see the solution at all the specified points and not just the last one, replace foldl with scanl.

    scanl rk (0, -1) [0.01, 0.03, 0.04, 0.06]

This returns

    [(0.0, -1.0), (0.01, -0.9917), (0.03, -0.9756), (0.042, -0.9678), (0.06, -0.9527)]

As pointed out in the previous post, writing algorithms as folds separates the core of the algorithm from data access. This would allow us, for example, to change rk independently, such as using a different order Runge-Kutta method. (Which hardly anyone does. Fourth order is a kind of sweet spot.) Or we could swap out Runge-Kutta for a different ODE solver entirely just by passing a different function into the fold.

Next see how you could use a fold to compute mean, variance, skewness, and kurtosis in one pass through a stream of data using a fold.

# Functional folds and conjugate models

Bayesian calculations are intrinsically recursive: The posterior distribution from one step becomes the prior for the next step.

With conjugate models, the prior and posterior belong to the same family of distributions. If a distribution from this family is determined by a fixed set of parameters, we only need to update these parameters at each step. This updating process is defined by an integration problem, i.e. applying Bayes’ theorem, but it’s common with conjugate models for the integration to reduce to a simple algebraic operation on the parameters.

For example, suppose some binary event that happens with probability θ where θ has a beta(α, β) distribution. When we take our next observation, the posterior distribution becomes beta(α+1, β) if the event occurred and beta(α, β+1) if it didn’t. There is an integration problem in the background that reduces to simply adding a 1 to the α parameter for every success and to the β parameter for every failure. (Standard terminology is to call an observation of the thing you’re interested in a “success” regardless of whether the thing you’re observing is good or bad. “Failure” is just an observation of the thing not happening, even if it’s good that it didn’t happen.)

Conjugate models have the structure of a fold in functional programming, also known as a reduce or accumulate operation. We’ll show below how to compute the posterior distribution in a beta-binomial and normal-normal model using folds in Haskell. (Technically a left associative fold, foldl. Haskell also has a right associative fold foldr that we won’t be concerned about here.)

## Folds in Haskell

What does foldl require? Let’s ask GHCi, the Haskell REPL. Prior to GHC version 7.10 we would get the following.

     ghci> :type foldl
foldl :: (a -> b -> a) -> a -> [b] -> a


Unfortunately the type variables a and b just mean “one type” and “a possibly different type” and tell us nothing about how the types are used, so let’s unravel this a little at a time.

Here “a” is the type of the accumulator state. In our case, it will be the prior and posterior distribution parameters because that’s what’s being updated for each data point. Conjugacy makes this work since the prior and posterior distributions belong to the same family. The “b” type is our data. So you could read the type of foldl as follows.

foldl takes three things: a function to fold over the data, an initial accumulator state, and an array of data. It returns the final accumulator state. The accumulator function takes an accumulator state and a data value, and returns a new accumulator state.

Starting with GHC 7.10 the type of foldl is a little more general.

     ghci> :type foldl
foldl :: Foldable t => (a -> b -> a) -> a -> t b -> a


Instead of foldl operating only on lists, it can operate on any “foldable” container. To recover the type from earlier versions of the compiler, replace t with []. (Haskell uses [] b and [b] as equivalent ways of indicating a list of elements of type b.) There is one other difference I’ve edited out: The latest version of GHCi reverses the roles of ‘a’ and ‘b’. This is confusing, but has no effect since the type variable names have no meaning. I swapped the a’s and b’s back like they were to make the two definitions easier to compare.

## Beta-binomial example

Returning to the beta-binomial model discussed above, suppose we have a sequence of observations consisting of a 1 when the event in question happened and a 0 when it did not. And suppose our prior distribution on the probability of the event has a beta(α, β) distribution.

Now what function will we fold over our data to update our prior into a posterior?

     ghci> let f params y = (fst params + y, snd params + 1 - y)


The function f takes a pair of numbers, the two beta distribution parameters, and a data point y, and returns an updated set of beta parameters. If the data point is a 1 (success) then the α> parameter is incremented, and if the data point is a 0 (failure) then the β parameter is incremented. If we start with a beta(0.2, 0.8) prior and observe the data [1, 1, 0, 0, 1] then we apply f to our data as follows.

     ghci> let d = [1, 1, 0, 0, 1]
ghci> foldl f (0.2, 0.8) d


The result will be (3.2, 2.8). The three successes incremented the first beta parameter and the two failures incremented the second beta parameter.

To see how the foldl operates, we can run scanl instead. It works like foldl but returns a list of intermediate accumulator values rather than just the final accumulator value.

     ghci> scanl f (0.2, 0.8) d


This returns

     [ (0.2, 0.8), (1.2, 0.8), (2.2, 0.8), (2.2, 1.8), (2.2, 2.8), (3.2, 2.8) ].

he initial accumulator state is (0.2, 0.8) because we started with a beta(0.2, 0.8) prior. Then we increment the accumulator state to (1.2, 0.8) because the first data point was a 1. Then we increment the state to (2.2, 0.8) because the second data point is also a 1. Next we see two failures in a row and so the second part of the accumulator is incremented two times. After observing the last data point, a success, our final state is (3.2, 2.8), just as when we applied foldl.

## Normal-normal example

Next we consider another Bayesian model with a conjugate prior. We assume our data come from a normal distribution with mean θ and known variance 1. We assume a conjugate prior distribution on θ, normal with mean μ0 and variance τ02.

This time the posterior distribution calculation is more complicated and so our accumulator function is more complicated. But the form of the calculation is the same as above: we fold an accumulator function over the data.

Here is the function that encapsulates our posterior distribution calculation.

     f params y = ((m/v  + y)*newv, newv)
where
m = fst params -- mean
v = snd params -- variance
newv = v/(v + 1)


Now suppose our prior distribution on θ has mean 0 and variance 4, and we observe two values, [3, 5].

     ghci> foldl f (0, 4) [3, 5]


This returns (3.5555, 0.4444). To see the intermediate accumulation state, i.e. after just seeing y = 3, we run scanl instead and see

     [ (0, 4), (2,4, 0,8), (3.5555, 0.4444) ]

## Inspiration

This post was inspired by a paper by Brian Beckman (in progress) that shows how a Kalman filter can be implemented as a fold. From a Bayesian perspective, the thing that makes the Kalman filter work is that a certain multivariate normal model has a conjugate prior. This post shows that conjugate models more generally can be implemented as folds over the data. That’s interesting, but what does it buy you? Brian’s paper discusses this in detail, but one advantage is that it completely separates the accumulator function from the data, so the former can be tested in isolation.

## Next up: ODEs

The next post shows how to implement an ODE solver (4th order Runge-Kutta) as a fold over the list of time steps where the solution is needed.

# Retronyms and Backronyms

A retronym is a new name created for an old thing, often made necessary by technological changes. For example, we have terms like “postal mail” or “snail mail” for what used to simply be “mail” because email has become the default. What was once called a “transmission” is now called a “manual transmission” since most cars (at least in the US) now have an automatic transmission.

A backronym is sort of a fictional etymology, such as a meaning retrofitted to an acronym. Richard Campbell explains Structured Query Language is a backronym for SQL.

IBM’s first database was not relational. Its second database, DB2, was a sequel to its first database, and so they wanted to call its query language SEQUEL but they were unable to copyright the name. So they dropped the vowels, shortened it to SQL. Later someone came up with the backronym “Structured Query Language.”

The APGAR score for newborns is a mnemonic backronym. Virginia Apgar came up with her scoring system ten years before someone came up with the backronym Appearance, Pulse, Grimace, Activity and Respiration.

# Flood control parks

The park in the photo above flooded. And that’s a good thing. It’s designed to flood so that homes don’t.

It’s not really a park that flooded. It’s a flood control project that most of the time doubles as a park. Ordinarily the park has a lake, but a few days a year the park is a lake.

Harris County, Texas has an unusually large amount of public recreational land. One reason the county can afford this is that some of the recreational land serves two purposes.

Other Houston-area posts:

# Kalman filters and bottom-up learning

Kalman filtering is a mixture of differential equations and statistics. Kalman filters are commonly used in tracking applications, such as tracking the location of a space probe or tracking the amount of charge left in a cell phone battery. Kalman filters provide a way to synthesize theoretical predictions and actual measurements, accounting for error in both.

Engineers naturally emphasize the differential equations and statisticians naturally emphasize the statistics. Both perspectives are valuable, but in my opinion/experience, the engineering perspective must come first.

From an engineering perspective, a Kalman filtering problem starts as a differential equation. In an ideal world, one would simply solve the differential equation and be done. But the experienced engineer realizes his or her differential equations don’t capture everything. (Unlike the engineer in this post.) Along the road to the equations at hand, there were approximations, terms left out, and various unknown unknowns.

The Kalman filter accounts for some level of uncertainty in the process dynamics and in the measurements taken. This uncertainty is modeled as randomness, but this doesn’t mean that there’s necessarily anything “random” going on. It simply acknowledges that random variables are an effective way of modeling miscellaneous effects that are unknown or too complicated to account for directly. (See Random is as random does.)

The statistical approach to Kalman filtering is to say that it is simply another estimation problem. You start from a probability model and apply Bayes’ theorem. That probability model has a term inside that happens to come from a differential equation in practice, but this is irrelevant to the statistics. The basic Kalman filter is a linear model with normal probability distributions, and this makes a closed-form solution for the posterior possible.

You’d be hard pressed to start from a statistical description of Kalman filtering, such as that given here, and have much appreciation for the motivating dynamics. Vital details have simply been abstracted away. As a client told me once when I tried to understand his problem starting from the top-down, “You’ll never get here from there.”

The statistical perspective is complementary. Some things are clear from the beginning with the statistical formulation that would take a long time to see from the engineering perspective. But while both perspectives are valuable, I believe it’s easier to start on the engineering end and work toward the statistics end rather than the other way around.

History supports this claim. The Kalman filter from the engineering perspective came first and its formulation in terms of Bayesian statistics came later. Except that’s not entirely true.

Rudolf Kálmán published his seminal paper in 1960 and four years later papers started to come out making the connection to Bayesian statistics. But while Kálmán and others were working in the US starting from the engineering end, Ruslan Stratonovich was working in Russia starting from the statistical end. Still, I believe it’s fair to say that most of the development and application of Kalman filters has proceeded from the engineering to the statistics rather than the other way around.

More on Kalman filters

# Cepstral alanysis vocabulary

An earlier post defined cepstrum and quefrency. This post explains some of the other quirky terms introduced in the same paper by Bogert, Healy, and Tukey. (Given Tukey’s delight in coining words, we can assume he was the member of the trio responsible for the new terms.)

The paper [1] explains why the new twists on familiar words:

In general, we find ourselves operating on the frequency side in ways customary on the time side and vice versa. Experience has made it clear that “words that sound like other words,” although strange at first sight, considerably reduce confusion on balance. These parallel or “paraphrased” words are made by the interchange of consonants or consonant groups, as in “alanysis” from “analysis,” and are introduced as needed.

The magnitude and phase of a cepstrum are called gamnitude and saphe. (The latter explains the pun “saphe cracking” in the title.)

Filtering in the cepstral domain is called liftering. A high-pass filter corresponds to a long-pass lifter and a low-pass filter corresponds to a short-pass lifter.

Harmonics in spectra correspond to rahmonics in cepstra.

Some of these terms are helpful. As explained in the previous post, the independent variable in cepstral analysis, quefrency, differs enough from frequency that it helps to have a separate term for it. Using the terms long and short rather than high and low is helpful for the same reason. Using repiod for the analog of period seems gratuitous, but maybe it’s necessary for consistency. Once you introduce some new terminology, you have to keep going.

[1] Bruce P. Bogert, M. J. R. Healy, John W. Tukey. The Quefrency Analysis of Time Series for Echoes: Cepstrum, Pseudo-Autocovariance, Cross-Cepstrum and Saphe Cracking. Collected works of John Tukey volume 1

# Selecting clients

One of the themes in David Ogilvy’s memoir Confessions of an Advertising Man is the importance of selecting good clients. For example, he advises “never take associations as clients” because they have “too many masters, too many objectives, too little money.”

He also recommends not taking on clients that are so large that you would lose your independence and financial robustness by taking them on.

I have never wanted to get an account so big that I could not afford to lose it. The day you do that, you commit yourself to living with fear. Frightened agencies lose the courage to give candid advice; once you lose that you become a lackey.

This is what lead me to refuse an invitation to compete for the Edsel account. I wrote to Ford: “Your account would represent one-half of our total billing. This would make it difficult for us to sustain our independence of counsel.” If we had entered the Edsel contest, and if we had won it, Ogilvy, Benson & Bather would have gone down the drain with Edsel.

This sort of thinking was very much on my mind when I was preparing to leave my last job to strike out on my own. As Nassim Taleb discusses in Antifragile, a steady job seems safer than entrepreneurship, but in some ways it’s not. With one big client, i.e. an employer, you are less exposed to small risks but more exposed to big risks. Your income doesn’t vary per month, unless it suddenly drops to zero.

In addition to looking for good clients, Ogilvy shares several stories of letting go of bad clients. I have yet to resign from a bad client—I haven’t had any bad clients—but I value the option to do so. The option to resign from a project makes it less likely that you’ll find yourself in a project you wish to resign from.

# How about one good one?

I’m no fan of tobacco companies or their advertising tactics, but I liked the following story.

When the head of a mammoth [advertising] agency solicited the Camel Cigarette account, he promised to assign thirty copywriters to it, but the canny head of R. J. Reynolds replied, “How about one good one?” Then he gave his account to a young copywriter called Bill Esty, in whose agency it has remained for twenty-eight years.

One really good person can accomplish more than thirty who aren’t so good, especially in creative work.

# Cepstrum, quefrency, and pitch

John Tukey coined many terms that have passed into common use, such as bit (a shortening of binary digit) and software. Other terms he coined are well known within their niche: boxplot, ANOVA, rootogram, etc. Some of his terms, such as jackknife and vacuum cleaner, were not new words per se but common words he gave a technical meaning to.

Cepstrum is an anagram of spectrum. It involves an unusual use of power spectra, and is roughly analogous to making anagrams of a word. A related term, one we will get to shortly, is quefrency, an anagram of frequency. Some people pronounce the ‘c’ in cepstrum hard (like ‘k’) and some pronounce it soft (like ‘s’).

Let’s go back to an example from my post on guitar distortion. Here’s a note played with a fairly large amount of distortion:

And here is its power spectrum:

There’s a lot going on in the spectrum, but the peaks are very regularly spaced. As I mentioned in the post on the sound of a leaf blower, this is the fingerprint of a sound with a definite pitch. Spikes in the spectrum alone don’t indicate a definite pitch if they are irregularly spaced.

The peaks are fairly periodic. How to you find periodic patterns in a signal? Fourier transform! But if you simply take the Fourier transform of a Fourier transform, you essentially get the original signal back. The key to the cepstrum is to do something else between the two Fourier transforms.

The cepstrum starts by taking the Fourier transform, then the magnitude, then the logarithm, and then the inverse Fourier transform.

When we take the magnitude, we throw away phase information, which we don’t need in this context. Taking the log of the magnitude is essentially what you do when you compute sound pressure level. Some define the cepstrum using the magnitude of the Fourier transform and some the magnitude squared. Squaring only introduces a multiple of 2 once we take logs, so it doesn’t effect the location of peaks, only their amplitude.

Taking the logarithm compresses the peaks, bringing them all into roughly the same range, making the sequence of peaks roughly periodic.

When we take the inverse Fourier transform, we now have something like a frequency, but inverted. This is what Tukey called quefrency.

Looking at the guitar power spectrum above, we see a sequence of peaks spaced 440 Hz apart. When we take the inverse Fourier transform of this, we’re looking at a sort of frequency of a frequency, what Tukey calls quefrency. The quefrency scale is inverted: sounds with a high frequency fundamental have overtones that are far apart on the frequency domain, so the sequence of the overtone peaks has low frequency.

Here’s the plot of the cepstrum for the guitar sample.

There’s a big peak at 109 on the quefrency scale. The audio clip was recorded at 48000 samples per second, so the 109 on the quefrency scale corresponds to a frequency of 48000/109 = 440 Hz. The second peak is at quefrency 215, which corresponds to 48000/215 = 223 Hz. The second peak corresponds to the perceived pitch of the note, A3, and the first peak corresponds to its first harmonic, A4. (Remember the quefrency scale is inverted relative to the frequency scale.)

I cheated a little bit in the plot above. The very highest peaks are at 0. They are so large that they make it hard to see the peaks we’re most interested in. These low quefrency peaks correspond to very high frequency noise, near the edge of the audible spectrum or beyond.

# Bring out your equations!

Nice discussion from Fundamentals of Kalman Filtering: A Practical Approach by Paul Zarchan and Howard Musoff:

Often the hardest part in Kalman filtering is the subject that no one talks about—setting up the problem. This is analogous to the quote from the recent engineering graduate who, upon arriving in industry, enthusiastically says, “Here I am, present me with your differential equations!” As the naive engineering graduate soon found out, problems in the real world are frequently not clear and are subject to many interpretations. Real problems are seldom presented in the form of differential equations, and they usually do not have unique solutions.

Whether it’s Kalman filters, differential equations, or anything else, setting up the problem is the hard part, or at least a hard part.

On the other hand, it’s about as impractical to only be able to set up problems as it is to only be able to solve them. You have to know what kinds of problems can be solved, and how accurately, so you can formulate a problem in a tractable way. There’s a feedback loop: provisional problem formulation, attempted solution, revised formulation, etc. It’s ideal when one person can set up and solve a problem, but it’s enough for the formulators and solvers to communicate well and have some common ground.

Related posts:

# Top tweets

I had a couple tweets this week that were fairly popular. The first was a pun on the musical Hamilton and the Hamiltonian from physics. The former is about Alexander Hamilton (1755–1804) and the latter is named after William Rowan Hamilton (1805–1865).

The second was a sort of snowclone, a variation on the line from the Bhagavad Gita that J. Robert Oppenheimer famously quoted in reference to the atomic bomb:

# Tonal prominence in a leaf blower

This afternoon I was working on a project involving tonal prominence. I stepped away from the computer to think and was interrupted by the sound of a leaf blower. I was annoyed for a second, then I thought “Hey, a leaf blower!” and went out to record it. A leaf blower is a great example of a broad spectrum noise with strong tonal components. Lawn maintenance men think you’re kinda crazy when you say you want to record the noise of their equipment.

The tuner app on my phone identified the sound as an A3, the A below middle C, or 220 Hz. Apparently leaf blowers are tenors.

Here’s a short audio clip:

And here’s what the spectrum looks like. The dashed grey vertical lines are at multiples of 55 Hz.

The peaks are perfectly spaced at multiples of the fundamental frequency of 55 Hz, A1 in scientific pitch notation. This even spacing of peaks is the fingerprint of a definite tone. There’s also a lot of random fluctuation between peaks. That’s the finger print of noise. So together we hear a pitch and noise.

When using the tone-to-noise ratio algorithm from the ECMA-74, only the spike at 110 Hz is prominent. A limitation of that approach is that it only considers single tones, not how well multiple tones line up in a harmonic sequence.

Related posts:

# Loudness and sharpness

This post looks at loudness and sharpness, two important psychoacoustic metrics. Because they have to do with human perception, these factors are by definition subjective. And yet they’re not entirely subjective. People tend to agree on when, for example, one sound is twice as loud as another, or when one sound is sharper than another.

## Loudness

Loudness is the psychological counterpart to sound pressure level. Sound pressure level is a physical quantity, but loudness is a psychoacoustic quantity. The former has to do with how a microphone perceives sound, the latter how a human perceives sound. Sound pressure level in dB and loudness in phon are roughly the same for a pure tone of 1 kHz. But loudness depends on the power spectrum of a sound and not just it’s sound pressure level. For example, if a sound’s frequency is too high or too low to hear, it’s not loud at all! See my previous post on loudness for more background.

Let’s take the four guitar sounds from the previous post and scale them so that each has a sound pressure level of 65 dB, about the sound level of an office conversation, then rescale so the sound pressure is 90 dB, fairly loud though not as loud as a rock concert. [Because sound perception is so nonlinear, amplifying a sound does not increase the loudness or sharpness of every component equally.]

Here are the audio files from the previous post:

Clean note:

Clean chord:

Distorted note:

Distorted chord:

Here’s the loudness, measured in phons, at both sound pressure levels.

|-----------------------+-------+-------|
| Sound                 | 65 dB | 90 dB |
|-----------------------+-------+-------|
| Clean note            |  70.9 |  94.4 |
| Clean chord           |  71.8 |  95.3 |
| Note with distortion  |  81.2 | 103.7 |
| Chord with distortion |  77.0 |  99.6 |
|-----------------------+-------+-------|


While all four sounds have the same sound pressure level, the undistorted sounds have the lowest loudness. The distorted sounds are louder, especially the single note. Increasing the sound pressure level from 65 dB to 90 dB increases the loudness of each sound by roughly the same amount. This will not be true of sharpness.

## Sharpness

Sharpness is related how much a sound’s spectrum is in the high end. You can compute sharpness as a particular weighted sum of the specific loudness levels in various bands, typically 1/3-octave bands. This weight function that increases rapidly toward the highest frequency bands. For more details, see Psychoacoustics: Facts and Models.

Here are the sharpness metrics for the four guitar sounds at 65 dB and 90 dB.

|-----------------------+-------+-------|
| Sound                 | 65 dB | 90 dB |
|-----------------------+-------+-------|
| Clean note            | 0.846 | 0.963 |
| Clean chord           | 0.759 | 0.914 |
| Note with distortion  | 1.855 | 2.000 |
| Chord with distortion | 1.281 | 1.307 |
|-----------------------+-------+-------|


Although a clean chord sounds a little louder than a single note, the former is a little sharper. Distortion increases sharpness as it does loudness. The single note with distortion is a little louder than the other sounds, but much sharper than the others.

Notice that increasing the sound pressure level increases the sharpness of the sounds by different amounts. The sharpness of the last sound hardly changes.

# Electric guitar distortion

The other day I asked on Google+ if someone could make an audio clip for me and Dave Jacoby graciously volunteered. I wanted a simple chord on an electric guitar played with varying levels of distortion. Dave describes the process of making the recording as

Fender Telecaster -> EHX LPB clean boost -> Washburn Soloist Distortion (when engaged) -> Fender Frontman 25R amplifier -> iPhone

Let’s look at the Fourier spectrum at four places in the recording: single note and chord, clean and distorted. These are a 0:02, 0:08, 0:39, and 0:43.

## Power spectra

The first note, without distortion, has most of it’s spectrum concentrated at 220 Hz, the A below middle C.

The same note with distortion has a power spectrum that decays much slow, i.e. the sound has more high frequency components.

Here’s the A major chord without distortion. Note that since the threshold of hearing is around 20 dB, most of the noise components are inaudible.

Here’s the same chord with distortion. Notice there’s much more noise in the audible range.

Update: See the next post an analysis of the loudness and sharpness of the audio samples in this post.

Photo via Brian Roberts CC

# Demystifying artificial intelligence

Computers do what we tell them to do. Period. Any talk of computers doing things they weren’t programmed to do is only a way of speaking. It’s a convenient shorthand when used properly, misleading mysticism when used improperly.

When you write a program

        print(24*7)

you could say that the computer isn’t programmed to print the number 168 in the sense that the code did not say

        print(168)

But of course the computer was programmed to print the number 168. It just wasn’t directly programmed to do so. Instead, it was given data and algorithms to apply to that data to produce the result. The program isn’t very useful because the degree of indirection is tiny. Artificial intelligence is more interesting because it increases the degree of indirection, but it’s still software instructing a computer to take in data and apply algorithms.

When someone says

A computer did this without being programmed!

mentally edit their statement to say

A computer did this without being directly programmed to do so!

The latter may still be impressive, but it’s not magic.

I was at a presentation once where software vendors were claiming that their software “discovered” the equation of motion for a pendulum. The software wasn’t directly programmed to do this, but it was programmed to read in data and find the best fit to the data from a set of basis functions which included sines and cosines. And it correctly found that a linear combination of sines and cosines best described the pendulum’s motion.

The pendulum example was not that impressive, though some applications of artificial intelligence truly are impressive, delivering results several layers of abstraction away from what the software was directly programmed to do. If there’s anything mysterious involved, it’s the statistical regularity of the world that allowed the software to make correct inference from the data it was given.