Frequently rediscovered technologies

Greenspun’s tenth rule of programming says

Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.

Here I’m going to take seriously a rule that was only not entirely serious. It’s saying three things about Lisp.

  1. It’s a frequently rediscovered technology. There’s something inevitable about it.
  2. It’s not completely widely known. Not everyone knows about it, so they don’t know that they’re reinventing it.
  3. It’s not easy to implement, hence all the poor implementations.

The same could be said of state machines. A number of projects have grown until they contained an ad hoc, informally-specified, bug-ridden, slow implementation of state machines.

What are other ideas like Lisp and state machines that are frequently and poorly reinvented?

Coffee after Obama

This morning I had coffee and a generous slice of squash bread at Top Pot Doughnuts in Seattle. When I went to pick up my coffee I saw a large photo of President Obama ordering coffee at the same place.

This is the second time I’ve been at a shop that proudly remembered our president stopping by. The first was a diner in San Juan that I cannot remember the name of. They put a plaque on the table President Obama sat at. If Top Pot did too, I missed it.

I enjoyed a few minutes of quiet this morning, something President Obama certainly did not have a chance to do here. One wall of the shop is lined with old books, so naturally that’s where I sat. I pulled down a small book that bound the December 1942 issues of My Garden magazine.

I find the ads in old magazines interesting for what they say between the lines. The ads are more interesting than the articles, the opposite of contemporary magazines, though that depends on the magazine.

One thing that struck me was the small ratio of advertisement to content. I suppose the original subscribers didn’t notice this, or maybe thought the magazine was putting in too many ads. I imagine they would be shocked by magazines that squeeze articles in between the ads.

Another thing that struck me was the somber but hopeful tone of the ads. Ford had a series of ads entitled “Until then …”. The ads seemed to say “We know nobody is going to be buying a car any time soon. But some day this war will be over, and when it is, we hope you’ll remember us.” Other companies tried to make the best of rationing. For example, one ad for shaving cream said that it was a good thing that their product worked so well since there are limits on how much shaving cream you can buy.

It was a pleasant way to start the day, without Secret Service agents buzzing around and without rationing limits on things like coffee and squash bread.

Octave holes on a saxophone

I’ve played saxophone since I was in high school, and I thought I knew how saxophones work, but I learned something new this evening. I was listening to a podcast [1] on musical acoustics and much of it was old hat. Then the host said that a saxophone has two octave holes.  Really?! I only thought there was only one.

When you press the octave key on the back of a saxophone with your left thumb, the pitch goes up an octave. Sometimes this causes a key on the neck to open up and sometimes it doesn’t [2]. I knew that much.

Saxophone with octave key not open on a high note

Saxophone with octave key open on a high note

 

I thought that when this key didn’t open, the octaves work like they do on a flute: no mechanical change to the instrument, but a change in the way you play. And to some extent this is right: You can make the pitch go up an octave without using the octave key. However, when the octave key is pressed there is a second hole that opens up when the more visible one on the neck closes.

Octave hole for low notes on a saxophone

According to the podcast, the first saxophones had two octave keys to operate with your thumb. You had to choose the correct octave key for the note you’re playing. Modern saxophones work the same as early saxophones except there is only one octave key controlling two octave holes.

* * *

[1] Musical Acoustics from The University of Edinburgh, iTunes U.

[2] On the notes written middle C up to A flat, the octave key raises the little hole I wasn’t aware of. For higher notes the octave key raises the octave hole on the neck.

 

Searching files on Windows

Searching files on Windows is a pain. The built-in search features don’t find everything. There may be ways to make them work, but I haven’t persisted long enough to make them work.

On Linux, the combination of find, xargs, and grep works well, and sometimes it works on Windows using the GOW or GnuWin port of these tools. Again there may be a way to make the ported utilities work more as expected, though I haven’t found it. I suspect the problem isn’t with the tools per se but their interaction with the command line. I also tried Emacs features like rgrep, but these features use the ported find and grep utilities, and so you run into the same problems with Emacs as you do running them directly and more.

Ack logo

It looks like ack is the way to go. I heard about it a long time ago and kept meaning to try it out. Now I finally did. It’s fast, convenient, etc. But here are the two things I most like about it:

  1. Ack works the same across platforms.
  2. Ack uses Perl regular expression syntax.

While the alternatives above are supposed to work the same across platforms, they don’t in my experience. But ack does because it’s a pure Perl program. All the portability has been delegated to Perl, where it is well handled. I imagine once I become more familiar with ack I’ll prefer it on Linux as well.

Because it’s a Perl program, ack uses Perl regex syntax. Perl has the most powerful regex implementation out there, though I seldom need any features unique to Perl. More important for me is that Perl regular expression dialect is the one I remember most easily.

Related posts:

For daily tips on regular expressions, follow @RegexTip on Twitter.

Regex tip icon

Energy in frequency modulated signals

In an earlier post we proved that if you modulate a cosine carrier by a sine signal you get a signal whose sideband amplitudes are given by Bessel functions. Specifically:

\cos( 2\pi f_c t + \beta \sin(2\pi f_m t) ) = \sum_{k=-\infty}^\infty J_n(\beta) \cos(2\pi(f_c + nf_m)t)

When β = 0, we have the unmodulated carrier, cos(2π fct), on both sides. When β is positive but small, J0(β) is near 1, and so the frequency component corresponding to the carrier is only slightly diminished. Also, the sideband amplitudes, the values of Jn(β) for n ≠ 0, are small and decay rapidly as |n| increases. As β increases, the amplitude of the carrier component decreases, the sideband amplitudes increase, and the sidebands decay more slowly.

We can be much more precise: the energy in the modulated signal is the same as the energy in the unmodulated signal. As β increases, more enery transfers to the sidebands, but the total energy stays the same. This conservation of energy result applies to more complex signals than just pure sine waves, but it’s easier to demonstrate in the case of a simple signal.

Proof

To prove the energy stays constant, we show that the sum of the squares of the coefficients of the cosine components is the same for the modulated and unmodulated signal.The unmodulated signal is just cos(2π fct), and so the only coefficient is 1. That means we have to prove

 \sum_{n=-\infty}^\infty J_n(\beta)^2 = 1

This is a well-known result. For example, it is equation 9.1.76 in Abramowitz and Stegun. We’ll show how to prove it from first principles. We’ll actually prove a more general result, the Newmann-Schläffi addition formula, then show our result follows easily from that.

Newmann-Schläffi addition formula

Whittaker and Watson define the Bessel functions by their generating function:

\exp\left(\frac{z}{2}\left(t - \frac{1}{t}\right)\right) = \sum_{n=-\infty}^\infty t^n J_n(z)

This means that when you expand the expression on the left as a power series in t, whatever is multiplied by tn is Jn(z) by definition. (There are other ways of defining the Bessel functions, but this way leads quickly to what we want to prove.)

We begin by factoring the Bessel generating function applied to zw.

\exp\left(\frac{z+w}{2}\left(t - \frac{1}{t}\right)\right) = \exp\left(\frac{z}{2}\left(t - \frac{1}{t}\right)\right) \exp\left(\frac{w}{2}\left(t - \frac{1}{t}\right)\right)

Next we expand both sides as power series.

\sum_{n=-\infty}^\infty t^n J_n(z+w) = \sum_{j=-\infty}^\infty t^j J_j(z) \sum_{k=-\infty}^\infty t^k J_k(w)

and look at the terms involving tn on both sides. On the left this is Jn(zw). On the right, we multiply two power series. We will get a term containing tn whenever we multiply terms tj and tk where j and k sum to n.

 J_n(z+w) = \sum_{j+k = n} J_j(z) J_k(w) = \sum_{m=-\infty}^\infty J_m(z) J_{n-m} J(w)

The equation above is the Newmann-Schläffi addition formula.

Sum of squared coefficients

To prove that the sum of the squared sideband coefficients is 1,  we apply the addition formula with n = 0, z = β, and w = -β.

1 = J_0(\beta - \beta) = \sum_{m=-\infty}^\infty J_m(\beta) J_{-m}(-\beta) = \sum_{m=-\infty}^\infty J_m(\beta)^2

This proves what we were after:

 \sum_{n=-\infty}^\infty J_n(\beta)^2 = 1

We used a couple facts in the last step that we haven’t discussed. The first was that J0(0) = 1. This follows from the generating function by setting z to 0 and taking the limit as t → 0. The second was that Jm(-β) = Jm(β). You can also see this from the generating function since negating z has the same effect as swapping t and 1/t.

Click to learn more about consulting help with signal processing

Related posts

Quantifying Loudness

How do you quantify how loud a sound is? Sounds like a simple question, but it’s not.

What is loudness?

It’s not hard to measure the physical intensity of a sound, but loudness is the perceived intensity of a sound. It is not a physical phenomena but a psychological phenomena.

Loudness is subjective, but not entirely so. There is general consensus regarding what it means for two sounds to be equally loud, and even for ratios, such as saying when one sound is twice as loud as the other. Loudness is quantifiable, but not easily so.

What does loudness depend on?

Loudness depends on several properties of a sound, such as its frequency, bandwidth, and duration. Loudness must depend on frequency because sounds that are too low or too high have no loudness at all because we simply cannot hear them. But even with the range of audible frequencies, loudness varies quite a bit by pitch. The graph below, via Wikipedia, shows equal loudness contours. The blue lines are from work by Fletcher and Munson in 1937. The red lines are the revised curves per the ISO 226:2003 standard.

Fletcher-Munson curves

The horizontal axis is frequency in Hz and the vertical axis is sound pressure level in decibels. The contour lines represent combinations of frequency and sound pressure level that are perceived to be equally loud. If a tuba and a flute sound equally loud, the sound pressure level coming from the tuba is much higher.

Notice that the curves are not parallel, They’re much closer together for low frequencies than for midrange frequencies, though they are roughly parallel for high frequencies. This means that if you recorded a piano, for example, playing each of its keys at equal loudness, the pitches wouldn’t sound equally loud unless you played the recording back at its original volume.

Complexities and simplifications

As complicated as this is, it’s still a simplification. It is based on pure tones, simple sine waves. A single musical instrument, much less an orchestra or a jackhammer, are more complicated. Loudness is highly nonlinear, and so you cannot say that the loudness of two sounds is the sum of their individual loudnesses. A-weighting is a relatively simple way to convert sound pressure levels to loudness, but is only accurate for pure tones at fairly low loudness levels.

To simplify thing further, consider a single pure tone, a sine wave at 1 kHz. (This is almost two octaves above middle C. See details here.) Loudness level in phons is defined to match sound pressure level in decibels for a 1 kHz pure tone. So a sound has a loudness level of 40 phons, for example, if it is perceived to be as loud as a pure 1 hKz tone at 40 dB.

At 1 kHz, loudness increases by a factor of 2 for every 10 dB increase in sound pressure level. But because nothing is simple in psychoacoustics, even this is a simplification. It only holds for sounds with loudness level 40 dB or greater. A quiet room is around 40 phons, so the added complications below 40 phons may not be relevant in many applications.

A pure tone at 1 kHz and 20 dB sounds more than four times softer than the same tone at 40 dB. The definition of loudness level in phons still holds below 40 phons. An oboe has a loudness level of 20 phons if it has the same loudness as a sine wave with frequency 1 kHz and sound pressure level 20 dB. But an oboe at 30 phons will sound more than twice as loud as one at 20 phons.

Summary

So where are we as far as calculating loudness? We’ve said a lot about what you can’t do, what complications have to be considered. But we’ve concluded this much: for a pure 1 kHz tone, the loudness in phons equals (by definition) the sound pressure level in decibels. And we’ve said how in principle you could define the loudness of other sounds: compare them to a 1 kHz tone that’s just as loud. We haven’t said how to compute this, only how you could determine it empirically.

In future posts I may write about how you do this using the ISO 532B standard or the newer ANSI S3.4-2007 standard.

Click to learn more about consulting help with signal processing

Related links

Rigor and Vigor in Mathematics

I just started reading Frequency Analysis, Modulation and Noise by Stanford Goldman. The writing is strikingly elegant and clear. Here is a paragraph from the introduction.

Rigorous mathematics has a rightful place of honor in human thought. However, it has wisely been said that vigor is more important than rigor in the use of mathematics by the average man. In the particular case of this volume, the amount of rigor will be used that is necessary for a thorough understanding of the subject at hand by a radio engineer; but when it appears that rigor will confuse rather than clarify the subject for an engineer, we shall trust in the correctness of the results established by rigorous methods by the pure mathematicians and use them without the background of a rigorous proof.

The back of the book says  “Professor Goldman’s exposition is both mathematically and physically enlightening and it is unusually well written.” So far I agree.

(I found the 1967 Dover paperback reprint of the original 1948 hardback at a used book store. I looked at Dover’s site while writing this and it doesn’t seem to be in print.)

Graphs and square roots modulo a prime

Imagine a clock with a prime number of hours. So instead of being divided into 12 hours, it might be divided into 11 or 13, for example. You can add numbers on this clock the way you’d add numbers on an ordinary clock: add the two numbers as ordinary integers and take the remainder by p, the number of hours on the clock. You can multiply them the same way: multiply as ordinary integers, then take the remainder by p. This is called arithmetic modulo p, or just mod p.

Which numbers have square roots mod p? That is, for which numbers x does there exist a number y such that y2 = x mod p? It turns out that of the non-zero numbers, half have a square root and half don’t. The numbers that have square roots are called quadratic residues and the ones that do not have square roots are called quadratic nonresidues. Zero is usually left out of the conversation. For example, if you square the numbers 1 through 6 and take the remainder mod 7, you get 1, 4, 2, 2, 4, and 1. So mod 7 the quadratic residues are 1, 2, and 4, and the nonresidues are 3, 5, and 6.

A simple, brute force way to tell whether a number is a quadratic residue mod p is to square all the numbers from 1 to p-1 and see if any of the results equals your number. There are more efficient algorithms, but this is the easiest to understand.

At first it seems that the quadratic residues and nonresidues are scattered randomly. For example, here are the quadratic residues mod 23:

[1, 2, 3, 4, 6, 8, 9, 12, 13, 16, 18]

But there are patterns, such as the famous quadratic reciprocity theorem. We can see some pattern in the residues by visualizing them on a graph. We make the numbers mod p vertices of a graph, and join two vertices a and b with an edge if ab is a quadratic residue mod p.

Here’s the resulting graph mod 13:

And here’s the corresponding graph for p = 17:

And here’s Python code to produce these graphs:

import networkx as nx
import matplotlib.pyplot as plt

def residue(x, p):
    for i in range(1, p):
        if (i*i - x) % p == 0:
            return True
    return False

p = 17
G = nx.Graph()
for i in range(p):
    for j in range(i+1, p):
        if residue(i-j, p):
            G.add_edge(i, j)

nx.draw_circular(G)
plt.show()

However, this isn’t the appropriate graph for all values of p. The graphs above are undirected graphs. We’ve implicitly assumed that if there’s an edge between a and b then there should be an edge between b and a. And that’s true for p = 13 or 17, but not, for example, for p = 11. Why’s that?

If p leaves a remainder of 1 when divided by 4, i.e. p = 1 mod 4, then x is a quadratic residue mod p if and only if –x is a quadratic residue mod p. So if ab is a residue, so is ba. This means an undirected graph is appropriate. But if p = 3 mod 4, x is a residue mod p if and only if –x is a non-residue mod p. This means we should use directed graphs if p = 3 mod 4. Here’s an example for p = 7.

The code for creating the directed graphs is a little different:

G = nx.DiGraph()
for i in range(p):
    for j in range(p):
        if residue(i-j, p):
            G.add_edge(i, j)

Unfortunately, the way NetworkX draws directed graphs is disappointing. A directed edge from a to b is drawn with a heavy line segment on the b end. Any suggestions for a Python package that draws more attractive directed graphs?

The idea of creating graphs this way came from Roger Cook’s chapter on number theory applications in Graph Connections. (I’m not related to Roger Cook as far as I know.)

You could look at quadratic residues modulo composite numbers too. And I imagine you could also make some interesting graphs using Legendre symbols.

Click to learn more about consulting for complex networks

 

Related posts:

The empty middle: why no one is average

In 1945, a Cleveland newspaper held a contest to find the woman whose measurements were closest to average. This average was based on a study of 15,000 women by Dr. Robert Dickinson and embodied in a statue called Norma by Abram Belskie. Out of 3,864 contestants, no one was average on all nine factors, and fewer than 40 were close to average on five factors. The story of Norma and the Cleveland contest is told in Todd Rose’s book The End of Average.

People are not completely described by a handful of numbers. We’re much more complicated than that. But even in systems that are well described by a few numbers, the region around the average can be nearly empty. I’ll explain why that’s true in general, then look back at the Norma example.

General theory

Suppose you have N points, each described by n independent, standard normal random variables. That is, each point has the form (x1, x2, x2, …, xn) where each xi is independent with a normal distribution with mean 0 and variance 1. The expected value of each coordinate is 0, so you might expect that most points are piled up near the origin (0, 0, 0, …, 0). In fact most points are in spherical shell around the origin. Specifically, as n becomes larger, most of the points will be in a thin shell with distance √n from the origin. (More details here.)

Simulated contest

In the contest above, n = 9, and so we expect most contestants to be about a distance of 3 from average when we normalize each of the factors being measured, i.e. we subtract the mean so that each factor has mean 0, and we divide each by its standard deviation so the standard deviation is 1 on each factor.

We’ve made several simplifying assumptions. For example, we’ve assumed independence, though presumably some of the factors measured in the contest were correlated. There’s also a selection bias: presumably women who knew they were far from average would not have entered the contest. But we’ll run with our simplified model just to see how it behaves in a simulation.

import numpy as np

# Winning critera: minimum Euclidean distance
def euclidean_norm(x):
    return np.linalg.norm(x)

# Winning criteria: min-max
def max_norm(x):
    return max(abs(x))

n = 9
N = 3864

# Simulated normalized measurements of contestants 
M = np.random.normal(size=(N, n))

euclid = np.empty(N)
maxdev = np.empty(N)
for i in range(N):
    euclid[i] = euclidean_norm(M[i,:])
    maxdev[i] = max_norm(M[i,:])

w1 = euclid.argmin()
w2 = maxdev.argmin()

print( M[w1,:] )
print( euclidean_norm(M[w1,:]) )
print( M[w2,:] )
print( max_norm(M[w2,:]) )

There are two different winners, depending on how we decide the winner. Using the Euclidean distance to the origin, the winner in this simulation was contestant 3306. Her normalized measurements were

[ 0.1807, 0.6128, -0.0532, 0.2491, -0.2634, 0.2196, 0.0068, -0.1164, -0.0740]

corresponding to a Euclidean distance of 0.7808.

If we judge the winner to be the one whose largest deviation from average is the smallest, the winner is contestant 1916. Her normalized measurements were

[-0.3757, 0.4301, -0.4510, 0.2139, 0.0130, -0.2504, -0.1190, -0.3065, -0.4593]

with the largest deviation being the last, 0.4593.

By either measure, the contestant closest to the average deviated significantly from the average in at least one dimension.

* * *

For daily posts on probability, follow @ProbFact on Twitter.

ProbFact twitter icon

Analyzing an FM signal

Frequency modulation combines a signal with a carrier wave by changing (modulating) the carrier wave’s frequency.

Starting with a cosine carrier wave with frequency fc Hz and adding a signal with amplitude β and frequency fm Hz results in the combination

\cos( 2\pi f_c t + \beta \sin(2\pi f_m t) )

The factor β is known as the modulation index.

We’d like to understand this signal in terms of cosines without any frequency modulation. It turns out the result is a set of cosines weighted by Bessel functions of β.

\cos( 2\pi f_c t + \beta \sin(2\pi f_m t) ) = \sum_{k=-\infty}^\infty J_n(\beta) \cos(2\pi(f_c + nf_m)t)

Component amplitudes

We will prove the equation above, but first we’ll discuss what it means for the amplitudes of the cosine components.

For small values of β, Bessel functions decay quickly, which means the first cosine component will be dominant. For larger values of β, the Bessel function values increase to a maximum then decay like one over the square root of the index. To see this we compare the coefficients for modulation index β = 0.5 and β = 5.0.

First, β = 0.5:

and now for β = 5.0:

For fixed β and large n we have

J_n(\beta) \approx \frac{\beta^n}{2^n \, n!}

and so the sideband amplitudes eventually decay very quickly.

Update: See this post for what the equation above says about energy moving from the carrier to sidebands.

Proof

To prove the equation above, we need three basic trig identities

\cos(A + B) &=& \cos A \cos B - \sin A \sin B \\ 2\cos A \cos B &=& \cos(A-B) + \cos(A+B) \\ 2\sin A \sin B &=& \cos(A-B) + \cos(A-B)

and a three Bessel function identities

\cos( z \sin \theta) &=& J_0(z) + 2\sum{k=1}^\infty J_{k}(z) \cos(2k\theta) \\ \sin( z \sin \theta) &=& 2\sum{k=1}^\infty J_{2k+1}(z) \cos((2k+1)\theta) \\ J_{-n}(z) &=& (-1)^n J_n(z)

The Bessel function identities above can be found in Abramowitz and Stegun as equations 9.1.42, 9.1.43, and 9.1.5.

And now the proof. We start with

\cos( 2\pi f_c t + \beta \sin(2\pi f_m t) )

and apply the sum identity for cosines to get

\cos(2\pi f_c t) \cos(\beta \sin(2\pi f_m t)) - \sin(2\pi f_c t) \sin(\beta \sin(2\pi f_m t))

Now let’s take the first term

 \cos(2\pi f_c t) \cos(\beta \sin(2\pi f_m t))

and apply one of our Bessel identities to expand it to

J_0(\beta) \cos(2\pi f_c t) + \sum_{k=1}^\infty J_{2k}(\beta) \left\{ \cos(2\pi (f_c - 2k f_m)t) + \cos(2\pi(f_c + 2k f_m)t) \right\}

which can be simplified to

\sum_{n \,\, \mathrm{even}} J_n(\beta) \cos(2\pi(f_c + nf_m)t)

where the sum runs over all even integers, positive and negative.

Now we do the same with the second half of the cosine sum. We expand

\sum_{n \,\, \mathrm{even}} J_n(\beta) \cos(2\pi(f_c + nf_m)t)

to

\sum_{k=1}^\infty J_{2k+1}(\beta) \left\{ \cos(2\pi (f_c - (2k+1) f_m)t) - \cos(2\pi(f_c + (2k+1) f_m)t) \right\}

which simplifies to

\sum_{k=1}^\infty J_{2k+1}(\beta) \left\{ \cos(2\pi (f_c - (2k+1) f_m)t) - \cos(2\pi(f_c + (2k+1) f_m)t) \right\}

where again the sum is over all (odd this time) integers. Combining the two halves gives our result

\cos( 2\pi f_c t + \beta \sin(2\pi f_m t) ) = \sum_{k=-\infty}^\infty J_n(\beta) \cos(2\pi(f_c + nf_m)t)

Click to learn more about consulting help with signal processing

Related post: Visualizing Bessel functions

Formulating applied math problems

Somewhere in school I got the backward idea that solving math problems is hard but that formulating them is easy. I don’t know if anybody ever said that to me. Maybe it was just implied by years of solving problems someone else had formulated.

A related wrong idea that I also picked up was that formulating math problems was not a mathematician’s responsibility. Someone, probably an engineer, would formulate the problem and hand it over to a mathematician. That happens occasionally, but that’s not how it usually works.

Formulating problems is hard, and it’s usually the applied mathematician’s responsibility, ideally with generous input from a domain area expert.

There are a lot of ways to turn a real world problem into a math problem, and maybe several of them would be adequate for the task at hand. Then you might as well choose the easiest one to understand and compute. Knowing several ways to formulate a problem increases your chances of find one approach that’s tractable. Particularly when you can determine what problem really needs to be solved, not just the problem you first see, you might give yourself more options for how to go about it.

Applied mathematicians don’t need to be an expert in every area of application, and of course cannot be. But they do need to meet clients half way (or more). They need to know something about the problem domain. They need to listen well and need to ask good questions. The questions help the mathematician get going, and they may also give the client something new to think about.

Curious numbers

A an n-digit number is said to be curious if the last n digits of its square are the same as the original number. For example, 252 = 625 and 762 = 5776. (Curious numbers are also known as automorphic numbers.)

There are bigger curious numbers, such as 212890625 and 787109376:

2128906252 = 45322418212890625

and

7871093762 = 619541169787109376.

And if the square of x has the same last n digits as x, so does the cube of x and all higher powers.

It turns out that for each n > 1, there are two curious numbers of length n.

Always two there are; no more, no less. — Yoda

There’s even a formula for the two solutions. The first is the remainder when 5 to the power 2n is divided by 10n and the second is 10n + 1 minus the first.

Here’s a little Python code to show that the first several solutions really are solutions.

for i in range(2, 20):
    a = 5**(2**i) % 10**i
    b = 10**i - a + 1
    print((a**2 - a)%10**i, (b**2 - b)%10**i)

Related: Applied number theory

Types of nonlinearity in PDEs

My advisor in grad school used to say

“Nonlinear” is not a hypothesis but the lack of a hypothesis.

To say something positive about nonlinear equations, you have to replace linearity with some specific property. You want to partially remove the restriction of linearity without letting just anything in.

In partial differential equations, one pattern of nonlinearity is to replace linear with monotone.

We say a function on the real line is monotone if x ≥ y implies f(x) ≥ f(y). Strictly speaking this is the definition of monotone non-decreasing, but in this context the “non-decreasing” qualifier is dropped. Now suppose f is a linear transformation on Rn. What could it mean for f to be monotone when statements like x > y don’t make sense? We could rewrite the one dimensional definition as saying

(f(x) – f(y))(xy) ≥ 0

for all x and y. This form generalizes to linear transformations if we interpret the multiplication above as inner product. More generally we say that an operator A from a Banach space V to its dual V* is monotone if

(A(x) – A(y))(xy) ≥ 0

where now instead of an inner product we have more generally the action of the linear functional A(x) – A(y) on the vector x – y. If the space V is a Hilbert space, then this is just an inner product, but in general it doesn’t have to be.

In applications to PDEs, the operator A would represent an operator between function spaces so that the PDE has the form Auf where u is a solution in V and the right hand side f is in V*. The operator A represents some weak form of the PDE and the space V is some sort of Sobolev space with the necessary boundary value assumptions baked in.

Monotonicity alone isn’t enough to prove existence or uniqueness. We need a few other properties.

We say the operator A from V to V* is coercive if Au(u) / ||u|| goes to infinity as ||u|| goes to infinity.

We say A is Type M if whenever un converges weakly to u, Aun converges weakly to f, and the lim sup of Aunf(u) all apply, then Au = f.

Here’s an example of the kind of theorems you can prove with these definitions.

If A is Type M, bounded, and coercive on a separable reflexive Banach space V to its dual V*, then A is surjective.

In application, the Banach space in the theorem is some sort of Sobolev space, functions in some Lp space whose generalized derivatives are in the same space, along with some boundary conditions. (You might wonder how boundary conditions can be defined for functions in such a space. They can’t directly, but they can indirectly via trace operators. Generalized boundary values for generalized functions. It’s all very generalized!)

Saying that A is surjective means the equation Au = f has a solution for any f in V*. So we reduce the problem of showing that the equation has a solution to verifying that A is Type M, bounded, and coercive. Type M is a form of continuity; bounded and coercive follow from a priori estimates.

Related posts:

Bayesian and nonlinear

Someone said years ago that you’ll know Bayesian statistics has become mainstream when people no longer put “Bayesian” in the titles of their papers. That day has come. While the Bayesian approach is still the preferred approach of a minority of statisticians, it’s no longer a novelty. If you want people to find your paper interesting, the substance needs to be interesting. A Bayesian approach isn’t remarkable alone.

You could say the same about nonlinear differential equations. Differential equations are so often nonlinear that the “nonlinear” qualifier isn’t always necessary to say explicitly. Just as a Bayesian analysis isn’t interesting just because it’s Bayesian, a differential equation isn’t necessarily interesting just because it’s nonlinear.

The analogy between Bayesian statistics and nonlinear differential equations breaks down though. Nonlinear equations are intrinsically more interesting than linear ones. But it’s no longer remarkable to solve a nonlinear differential equation numerically.

When an adjective becomes the default, it drops off and the previous default now requires an adjective. Terms like “electronic” and “digital” are fading from use. If you say you’re going to mail someone something, the default assumption is usually that you are going to email it. What used to be simply “mail” is now “snail mail.” Digital signal processing is starting to sound quaint. The abbreviation DSP is still in common use, but digital signal processing is simply signal processing. Now non-digital signal processing requires a qualifier, i.e. analog.

There was no term for Frequentist statistics when it was utterly dominant. Now of course there is. (Some people use the term “classical,” but that’s an odd term given that Bayesian analysis is older.) The term linear has been around a long time. Even when nearly all analysis was linear, people were aware that linearity was a necessary simplification.

Related posts:

Improving on Chebyshev’s inequality

Chebyshev’s inequality says that the probability of a random variable being more than k standard deviations away from its mean is less than 1/k2. In symbols,

P(|X - E(X)| \geq k\sigma) \leq \frac{1}{k^2}

This inequality is very general, but also very weak. It assumes very little about the random variable X but it also gives a loose bound. If we assume slightly more, namely that X has a unimodal distribution, then we have a tighter bound, the Vysochanskiĭ-Petunin inequality.

P(|X - E(X)| \geq k\sigma) \leq \frac{4}{9k^2}

However, the Vysochanskiĭ-Petunin inequality does require that k be larger than √(8/3). In exchange for the assumption of unimodality and the restriction on k we get to reduce our upper bound by more than half.

While tighter than Chebyshev’s inequality, the stronger inequality is still very general. We can usually do much better if we can say more about the distribution family. For example, suppose X has a uniform distribution. What is the probability that X is more than two standard deviations from its mean? Zero, because two standard deviations puts you outside the interval the uniform is defined on!

Among familiar distributions, when is the Vysochanskiĭ-Petunin inequality most accurate? That depends, of course, on what distributions you consider familiar, and what value of k you use. Let’s look at normal, exponential, and Pareto. These were chosen because they have thin, medium, and thick tails. We’ll also throw in the double exponential, because it has the same tail thickness as exponential but is symmetric. We’ll let k be 2 and 3.

Distribution family P(|X – E(X)| > 2σ) V-P estimate P(|X – E(X)| > 3σ) V-P estimate
Uniform 0.0000 0.1111 0.0000 0.0484
Normal 0.0455 0.1111 0.0027 0.0484
Exponential 0.0498 0.1111 0.0183 0.0484
Pareto 0.0277 0.1111 0.0156 0.0484
Double exponential 0.0591 0.1111 0.0144 0.0484

A normal random variable is more than 2 standard deviations away from its mean with probability 0.0455, compared to the Vysochanskiĭ-Petunin bound of 1/9 = 0.1111. A normal random variable is more than 3 standard deviations away from its mean with probability 0.0027, compared to the bound of 4/81 = 0.0484.

An exponential random variable with mean μ also has standard deviation μ, so the only way it could be more than 2μ from its mean is to be 3μ from 0. So an exponential is more that 2 standard deviations from its mean with probability exp(-3) = 0.0498, and more than 3 standard deviations with probability exp(-4) = 0.0183.

We’ll set the minimum value of our Pareto random variable to be 1. As with the exponential, the Pareto cannot be 2 standard deviations less than its mean, so we look at the probability of it being more than 2 greater than its mean. The shape parameter α must be bigger than 2 for for the variance to exist. The probability of our random variable being more than k standard deviations away from its mean works out to ((α-1)/((k-1)α))α and is largest as α converges down toward 2. The limiting values for k equal to 2 and 3 are 1/36 = 0.0277 and 1/64 = 0.0156 respectively. Of our examples, the Pareto distribution comes closest to the Vysochanskiĭ-Petunin bounds, but doesn’t come that close.

The double exponential, also know as Laplace, has the highest probability of any of our examples of being two standard deviations from its mean, but this probability is still less than half of the Vysochanskiĭ-Petunin bound. The limit of the Pareto distribution has the highest probability of being three standard deviations from its mean, but stil less than one-third of the Vysochanskiĭ-Petunin bound.

Generic bounds are useful, especially in theoretical calculations, but it’s usually possible to do much better with specific distributions.

More inequality posts:

For daily posts on probability, follow @ProbFact on Twitter.

ProbFact twitter icon