Runge phenomena

I’ve mentioned the Runge phenomenon in a couple posts before. Here I’m going to go into a little more detail.

First of all, the “Runge” here is Carl David Tolmé Runge, better known for the Runge-Kutta algorithm for numerically solving differential equations. His name rhymes with cowabunga, not with sponge.

Runge showed that polynomial interpolation at evenly-spaced points can fail spectacularly to converge. His example is the function f(x) = 1/(1 + x²) on the interval [-5, 5], or equivalently, and more convenient here, the function f(x) = 1/(1 + 25x²) on the interval [-1, 1]. Here’s an example with 16 interpolation nodes.

Runge's example

Runge found that in order for interpolation at evenly spaced nodes in [-1, 1] to converge, the function being interpolated needs to be analytic inside a football-shaped [1] region of the complex plane with major axis [-1, 1] on the real axis and minor axis approximately [-0.5255, 0.5255]  on the imaginary axis. For more details, see [2].

The function in Runge’s example has a singularity at 0.2i, which is inside the football. Linear interpolation at evenly spaced points would converge for the function f(x) = 1/(1 + x²) since the singularity at i is outside the football.

Runge's example

For another example, consider the function f(x) = exp(- 1/x²) , defined to be 0 at 0. This function is infinitely differentiable but it is not analytic at the origin. With only 16 interpolation points as above, there’s a small indication of trouble at the ends.

Interpolating exp(-1/x^2)

With 28 interpolation points in the plot below, the lack of convergence is clear.

Interpolating exp(-1/x^2)

The problem is not polynomial interpolation per se but polynomial interpolation at evenly-spaced nodes. Interpolation at Chebyshev points converges for the examples here. The location of singularities effects the rate of convergence but not whether the interpolants converge.

RelatedHelp with interpolation

***

[1] American football, that is. The region is like an ellipse but pointy at -1 and 1.

[2] Approximation Theory and Approximation Practice by Lloyd N. Trefethen

Twenty questions and conditional probability

The previous post compared bits of information to answers in a game of Twenty Questions.

The optimal strategy for playing Twenty Questions is for each question to split the remaining possibilities in half. There are a couple ways to justify this strategy: mixmax and average.

The minmax approach is to minimize the worse thing that could happen. The worst thing that could happen after asking a question is for your subject to be in the larger half of possibilities. For example, asking whether someone is left-handed is not a good strategy: the worst case scenario is “no,” in which case you’ve only narrowed your possibilities by 10%. The best mixmax approach is to split the population in half with each question.

The best average approach is also to split the possibilities in half each time. With the handedness example, you learn more if the answer is “yes,” but there’s only a 10% change that the answer is yes. There’s a 10% chance of gaining 3.322 bits of information, but a 90% chance of only gaining 0.152 bits. So the expected number of bits, the entropy, is 0.469 bits. Entropy is maximized when all outcomes are equally likely. That means you can’t learn more than 1 bit of information on average from a yes/no question, and you learn the most when both possibilities are equally likely.

Now suppose you want to ask about height and sex. As in the previous post, we assume men and women’s heights are normally distributed with means 70 and 64 inches respectively, and both have standard deviation 3 inches.

If you ask whether a person is taller than 67 inches, you split a mixed population of men and women in half. You will learn 1 bit of information from this question, but you’ve put yourself in a suboptimal position for the next question. A height of at least 67 inches half of the adult population in general, but it selects a majority of men and a minority of women. And as we discussed above, uneven splits are suboptimal, in the worst case and on average.

If you’re going to ask about height and sex, ask about sex first. If the person is female, ask next whether her height is above 64 inches. But if the person is male, ask whether his height is above 70 inches. That is, you want to split the population evenly at each step conditioning on your previous answer. A cutoff of 67 inches is optimal unconditionally, but suboptimal if you condition on sex.

The optimal strategy for Twenty Questions is to ask a question with probability of 1/2 being true, conditional on all previous data. You might get lucky with uneven conditional splits, but on average, and in the worst case, you won’t do as well.

Handedness, introversion, height, blood type, and PII

I’ve had data privacy on my mind a lot lately because I’ve been doing some consulting projects in that arena.

When I saw a tweet from Tim Hopper a little while ago, my first thought was “How many bits of PII is that?”. [1]

Let’s see. There’s some small correlation between these characteristics, but let’s say they’re independent. (For example, someone over 6′ 5″ is most likely male, and a larger percentage of males than females are left handed. But we’ll let that pass. This is just back-of-the-envelope reckoning.)

About 10% of the population is left-handed (11% for men, 9% for women) and so left-handedness caries -log2(0.1) = 3.3 bits of information.

I don’t know how many people identify as introverts. I believe I’m a mezzovert, somewhere between introvert and extrovert, but I imagine when asked most people would pick “introvert” or “extrovert,” maybe half each. So we’ve got about one bit of information from knowing someone is an introvert.

The most common blood type in the US is O+ at 37% and so that carries 1.4 bits of information. (AB-, the most rare, corresponds to 7.4 bits of information. On average, blood type carries 2.2 bits of information in the US.)

What about height? Adult heights are approximately normally distributed, but not exactly. The normal approximation breaks down in the extremes, and we’re headed that way, but as noted above, this is just a quick and dirty calculation.

Heights in general don’t follow a normal distribution, but heights for men and women separately follow a normal. So for the general (adult) population, height follows a mixture distribution. Assume the average height for women is 64 inches, the average for men is 70 inches, and both have a standard deviation of 3 inches. Then the probability of a man being taller than 6′ 5″ would be about 0.001 and the probability of a woman being that tall would be essentially zero [2]. So the probability that a person is over 6′ 5″ would be about 0.0005, corresponding to about 11 bits of information.

All told, there are 16.7 bits of information in the tweet above, as much information as you’d get after 16 or 17 questions of the game Twenty Questions, assuming all your questions are independent and have probability 1/2 of being answered affirmative.

***

[1] PII = Personally Identifiable Information

[2] There are certainly women at least 6′ 5″. I can think of at least one woman I know who may be that tall. So the probability shouldn’t be less than 1 in 7 billion. But the normal approximation gives a probability of 8.8 × 10-15. This is an example of where the normal distribution assumption breaks down in the extremes.

Pareto distribution and Benford’s law

The Pareto probability distribution has density

f(x) = \frac{a}{x^{a+1}}

for x ≥ 1 where a > 0 is a shape parameter. The Pareto distribution and the Pareto principle (i.e. “80-20” rule) are named after the same person, the Italian economist Vilfredo Pareto.

Samples from a Pareto distribution obey Benford’s law in the limit as the parameter a goes to zero. That is, the smaller the parameter a, the more closely the distribution of the first digits of the samples come to following the distribution known as Benford’s law.

Here’s an illustration of this comparing the distribution of 1,000 random samples from a Pareto distribution with shape a = 1 and shape a = 0.2 with the counts expected under Benford’s law.

Distribution of leading digits of Pareto samples in base 10

Note that this has nothing to do with base 10 per se. If we look at the leading digits as expressed in any other base, such as base 16 below, we see the same pattern.

Distribution of leading digits of Pareto samples in base 16

More posts on Benford’s law


More posts on Pareto

Random number generation posts

Random number generation is typically a two step process: first generate a uniformly distributed value, then transform that value to have the desired distribution. The former is the hard part, but also the part more likely to have been done for you in a library. The latter is relatively easy in principle, though some distributions are hard to (efficiently) sample from.

Here are some posts on testing a uniform RNG.

Here’s a book chapter I wrote on testing the transformation of a uniform RNG into some other distribution.

A few posts on manipulating a random number generator.

And finally, a post on a cryptographically secure random number generator.

Quantifying information gain in beta-binomial Bayesian model

The beta-binomial model is the “hello world” example of Bayesian statistics. I would call it a toy model, except it is actually useful. It’s not nearly as complicated as most models used in application, but it illustrates the basics of Bayesian inference. Because it’s a conjugate model, the calculations work out trivially.

For more on the beta-binomial model itself, see A Bayesian view of Amazon Resellers and Functional Folds and Conjugate Models.

I mentioned in a recent post that the Kullback-Leibler divergence from the prior distribution to the posterior distribution is a measure of how much information was gained.

Here’s a little Python code for computing this. Enter the a and b parameters of the prior and the posterior to compute how much information was gained.

    from scipy.integrate import quad
    from scipy.stats import beta as beta
    from scipy import log2

    def infogain(post_a, post_b, prior_a, prior_b):

        p = beta(post_a, post_b).pdf
        q = beta(prior_a, prior_b).pdf

        (info, error) = quad(lambda x: p(x) * log2(p(x) / q(x)), 0, 1)
        return info

This code works well for medium-sized inputs. It has problems with large inputs because the generic integration routine quad needs some help when the beta distributions become more concentrated.

You can see that surprising input carries more information. For example, suppose your prior is beta(3, 7). This distribution has a mean of 0.3 and so your expecting more failures than successes. With such a prior, a success changes your mind more than a failure does. You can quantify this by running these two calculations.

    print( infogain(4, 7, 3, 7) )
    print( infogain(3, 8, 3, 7) )

The first line shows that a success would change your information by 0.1563 bits, while the second shows that a failure would change it by 0.0297 bits.

Big aggregate queries can still violate privacy

Suppose you want to prevent your data science team from being able to find out information on individual customers, but you do want them to be able to get overall statistics. So you implement two policies.

  1. Data scientists can only query aggregate statistics, such as counts and averages.
  2. These aggregate statistics must be based on results that return at least 1,000 database rows.

This sounds good, but it’s naive. It’s not enough to protect customer privacy.

Someone wants to know how much money customer 123456789 makes. If he asks for this person’s income, the query would be blocked by the first rule. If he asks for the average income of customers with ID 123456789 then he gets past the first rule but not the second.

He decides to test whether the customer in question makes a six-figure income. So first he queries for the number of customers with income over $100,000. This is a count so it gets past the firs rule. The result turns out to be 14,254, so it gets past the second rule as well. Now he asks how many customers with ID not equal to 123456789 have income over $100,000. This is a valid query as well, and returns 14,253. So by executing only queries that return aggregate statistics on thousands of rows, he found out that customer 123456789 has at least a six-figure income.

Now he goes back and asks for the average income of customers with income over $100,000. Then he asks for the average income of customers with income over $100,000 and with ID not equal to 123456789. With a little algebra, he’s able to find customer 123456789’s exact income.

You might object that it’s cheating to have a clause such as “ID not equal 123456789” in a query. Of course it’s cheating. It clearly violates the spirit of the law, but not the letter. You might try to patch the rules by saying you cannot ask questions about a small set, nor about the complement of a small set. [1]

That doesn’t work either. Someone could run queries on customers with ID less than or equal to 123456789 and on customers with ID greater than or equal to 123456789. Both these sets and their complements may be large but they let you find out information on an individual.

You may be asking why let a data scientist have access to customer IDs at all. Obviously you wouldn’t do that if you wanted to protect customer privacy. The point of this example is that limiting queries to aggregates of large sets is not enough. You can find out information on small sets from information on large sets. This could still be a problem with obvious identifiers removed.

Related:

***

[1] Readers familiar with measure theory might sense a σ-algebra lurking in the background.

Visualizing complex functions

It’s easy to visualize function from two real variables to one real variable: Use the function value as the height of a surface over its input value. But what if you have one more dimension in the output? A complex function of a complex variable is equivalent to a function from two real variables to two real variables. You can use one of the output variables as height, but what do you do about the other one?

Annotating phase on 3D plots

You can start by expressing the function output f(z) in polar form

f(z) = reiθ

Then you could plot the magnitude r as height and write the phase angle θ on the graph. Here’s a famous example of that approach from the cover of Abramowitz and Stegun.

You can find more on that book and the function it plots here.

This approach makes it easy to visualize the magnitude, but the phase is textual rather than visual. A more visual approach is to use color to represent the phase, such as hue in HSV scale.

Hue-only phase plots

You can combine color with height, but sometimes you’ll just use color alone. That is the approach taken in the book Visual Complex Functions. This is more useful than it may seem at first because the magnitude and phase are tightly coupled for an analytic function. The phase alone uniquely determines the function, up to a constant multiple. The image I posted the other day, the one I thought looked like Paul Klee meets Perry the Platypus, is an example of a phase-only plot.

filter contour plot

 

If I’d added more contour lines the plot would be more informative, but it would look less like a Paul Klee painting and less like a cartoon platypus, so I stuck with the defaults. Mathematica has dozens of color schemes for phase plots. I stumbled on “StarryNightColors” and liked it. I imagine I wouldn’t have seen the connection to Perry the Playtpus in a different color scheme.

Using hue and brightness

You can visualize magnitude as well as phase if you add another dimension to color. That’s what Daniel Velleman did in a paper that I read recently [1]. He uses hue to represent angle and brightness to represent magnitude. I liked his approach partly on aesthetic grounds. Phase plots using hue only tend to look psychedelic and garish. The varying brightness makes the plots more attractive. I’ll give some examples then include Velleman’s code.

First, here’s a plot of Jacobi’s sn function [2], an elliptic function analogous to sine. I picked it because it has a zero and a pole. Zeros show up as black and poles as white. (The function is elliptic, so it’s periodic horizontally and vertically. Functions like sine are periodic horizontally but not vertically.)

f[z_] := JacobiSN[z I, 1.5]; ComplexPlot[-2, -1, 2, 1, 200, 200]

You can see the poles on the left and right and the zero in the middle. Note that the hues swirl in opposite directions around zeros and poles: ROYGBIV counterclockwise around a zero and clockwise around a pole.

Next, here’s a plot of the 5th Chebyshev polynomial. I chose this because I’ve been thinking about Chebyshev polynomials lately, and because polynomials make plots that fade to white in every direction. (That is, |p(z)| → ∞ as |z| → ∞ for all polynomials.)

f[z_] := ChebyshevT[5, z]; ComplexPlot[-2, -2, 2, 2, 400, 400]

Finally, here’s a plot of the gamma function. I included this example because you can see the poles as little white dots on the left, and the plot has a nice dark-to-light overall pattern.

f[x_] := Gamma[x]; ComplexPlot[-3.5, -3, 7, 3, 200, 200]

Mathematica code

Here’s the Mathematica code from Velleman’s paper.  Note that in the HSV scale, he uses brightness to change both the saturation (S) and value (V).  Unfortunately the function f being plotted is a global rather than being passed in as an argument to the plotting function.

brightness[x_] := If[x <= 1, 
    Sqrt[x]/2, 
    1 - 8/((x + 3)^2)] 

ComplexColor[z_] := If[z == 0, 
    Hue[0, 1, 0], 
    Hue[Arg[z]/(2 Pi), 
        (1 - (b = brightness[Abs[z]])^4)^0.25, 
        b]] 

ComplexPlot[xmin_, ymin_, xmax_, ymax_, xsteps_, ysteps_] := Block[
    {deltax = N[(xmax - xmin)/xsteps], 
     deltay = N[(ymax - ymin)/ysteps]}, 
    Graphics[
        Raster[ 
            Table[
                f[x + I y], 
                {y, ymin + deltay/2, ymax, deltay}, 
                {x, xmin + deltax/2, xmax, deltax}], 
                {{xmin, ymin}, {xmax, ymax}}, 
                ColorFunction -> ComplexColor
            ]
        ]
    ]

 

Related posts

Footnotes

[1] Daniel J. Velleman. The Fundamental Theorem of Algebra: A Visual Approach. Mathematical Intelligencer, Volume 37, Number 4, 2015.

[2] I used f[z] = 10 JacobiSN[I z, 1.5]. I multiplied the argument by i because I wanted to rotate the picture 90 degrees. And I multiplied the output by 10 to get a less saturated image.

Why is Kullback-Leibler divergence not a distance?

The Kullback-Leibler divergence between two probability distributions is a measure of how different the two distributions are. It is sometimes called a distance, but it’s not a distance in the usual sense because it’s not symmetric. At first this asymmetry may seem like a bug, but it’s a feature. We’ll explain why it’s useful to measure the difference between two probability distributions in an asymmetric way.

The Kullback-Leibler divergence between two random variables X and Y is defined as

KL(X || Y) = -\int f_X(x) \log \frac{f_Y(x)}{f_X(x)} \, dx

This is pronounced/interpreted several ways:

  • The divergence from Y to X
  • The relative entropy of X with respect to Y
  • How well Y approximates X
  • The information gain going from the prior Y to the posterior X
  • The average surprise in seeing Y when you expected X

A theorem of Gibbs proves that K-L divergence is non-negative. It’s clearly zero if X and Y have the same distribution.

The K-L divergence of two random variables is an expected value, and so it matters which distribution you’re taking the expectation with respect to. That’s why it’s asymmetric.

As an example, consider the probability densities below, one exponential and one gamma with a shape parameter of 2.

exponential and gamma PDFs

The two densities differ mostly on the left end. The exponential distribution believes this region is likely while the gamma does not. This means that an expectation with respect to the exponential distribution will weigh things in this region more heavily. In an information-theoretic sense, an exponential is a better approximation to a gamma than the other way around.

Here’s some Python code to compute the divergences.

    from scipy.integrate import quad
    from scipy.stats import expon, gamma
    from scipy import inf

    def KL(X, Y):
        f = lambda x: -X.pdf(x)*(Y.logpdf(x) - X.logpdf(x))
        return quad(f, 0, inf)

    e = expon
    g = gamma(a = 2)

    print( KL(e, g) )
    print( KL(g, e) )

This returns

    (0.5772156649008394, 1.3799968612282498e-08)
    (0.4227843350984687, 2.7366807708872898e-09)

The first element of each pair is the integral and the second is the error estimate. So apparently both integrals have been computed accurately, and the first is clearly larger. This backs up our expectation that it’s more surprising to see a gamma when expecting an exponential than vice versa.

Although K-L divergence is asymmetric in general, it can be symmetric. For example, suppose X and Y are normal random variables with the same variance but different means. Then it would be equally surprising to see either one when expecting the other. You can verify this in the code above by changing the KL function to integrate over the whole real line

    def KL(X, Y):
        f = lambda x: -X.pdf(x)*(Y.logpdf(x) - X.logpdf(x))
        return quad(f, -inf, inf)

and trying an example.

n1 = norm(1, 1)
n2 = norm(2, 1)

print( KL(n1, n2) )
print( KL(n2, n1) )

This returns

(0.4999999999999981, 1.2012834963423225e-08)
(0.5000000000000001, 8.106890774205374e-09)

and so both integrals are equal to within the error in the numerical integration.

Chebyshev interpolation

Fitting a polynomial to a function at more points might not produce a better approximation. This is Faber’s theorem, something I wrote about the other day.

If the function you’re interpolating is smooth, then interpolating at more points may or may not improve the fit of the interpolation, depending on where you put the points. The famous example of Runge shows that interpolating

f(x) = 1 / (1 + x²)

at more points can make the fit worse. When interpolating at 16 evenly spaced points, the behavior is wild at the ends of the interval.

Runge example

Here’s the Python code that produced the plot.

    import matplotlib.pyplot as plt
    from scipy import interpolate, linspace

    def cauchy(x):
        return (1 + x**2)**-1

    n = 16
    x = linspace(-5, 5, n) # points to interpolate at

    y = cauchy(x)
    f = interpolate.BarycentricInterpolator(x, y)

    xnew = linspace(-5, 5, 200) # points to plot at
    ynew = f(xnew)
    plt.plot(x, y, 'o', xnew, ynew, '-')
    plt.show()

However, for smooth functions interpolating at more points does improve the fit if you interpolate at the roots of Chebyshev polynomials. As you interpolate at the roots of higher and higher degree Chebyshev polynomials, the interpolants converge to the function being interpolated. The plot below shows how interpolating at the roots of T16, the 16th Chebyshev polynomial, eliminates the bad behavior at the ends.

Runge example at Chebyshev nodes

To make this plot, we replaced x above with the roots of T16, rescaled from the interval [-1, 1] to the interval [-5, 5] to match the example above.

    x = [cos(pi*(2*k-1)/(2*n)) for k in range(1, n+1)]
    x = 5*array(x)

What if the function we’re interpolating isn’t smooth? If the function has a step discontinuity, we can see Gibbs phenomena, similar to what we saw in the previous post. Here’s the result of interpolating the indicator function of the interval [-1, 1] at 100 Chebyshev points. We get the same “bat ears” as before.

Gibbs phenomena for Chebyshev interpolation

Related: Help with interpolation

Fourier-Bessel series and Gibbs phenomena

Fourier-Bessel series are analogous to Fourier series. And like Fourier series, they converge pointwise near a discontinuity with the same kind of overshoot and undershoot known as the Gibbs phenomenon.

Fourier-Bessel series

Bessel functions come up naturally when working in polar coordinates, just as sines and cosines come up naturally when working in rectangular coordinates. You can think of Bessel functions as a sort of variation on sine waves. Or even more accurately, a variation on sinc functions, where sinc(z) = sin(z)/z. [1]

A Fourier series represents a function as a sum of sines and cosines of different frequencies. To make things a little simpler here, I’ll only consider Fourier sine series so I don’t have to repeatedly say “and cosine.”

f(z) = \sum_{n=1}^\infty c_n \sin(n \pi z)

A Fourier-Bessel function does something similar. It represents a function as a sum of rescaled versions of a particular Bessel function. We’ll use the Bessel J0 here, but you could pick some other Jν.

Fourier series scale the sine and cosine functions by π times integers, i.e. sin(πz), sin(2πz), sin(3πz), etc. Fourier-Bessel series scale by the zeros of the Bessel function: J01z),  J02z),  J03z), etc. where λn are the zeros of J0. This is analogous to scaling sin(πz) by its roots: π, 2π, 3π, etc. So a Fourier-Bessel series for a function f looks like

f(z) = \sum_{n=1}^\infty c_n J_0(\lambda_n z).

The coefficients cn for Fourier-Bessel series can be computed analogously to Fourier coefficients, but with a couple minor complications. First, the basis functions of a Fourier series are orthogonal over [0, 1] without any explicit weight, i.e. with weight 1. And second, the inner product of a basis function doesn’t depend on the frequency. In detail,

\int_0^1 \sin(m \pi z) \, \sin(n \pi z) \, dz = \frac{\delta_{mn}}{2}.

Here δmn equals 1 if m = n and 0 otherwise.

Fourier-Bessel basis functions are orthogonal with a weight z, and the inner product of a basis function with itself depends on the frequency. In detail

\int_0^1 J_0(\lambda_m z) \, J_0(\lambda_n z) \, z\, dz = \frac{\delta_{mn}}{2} J_1(\lambda_n).

So whereas the coefficients for a Fourier sine series are given by

c_n = 2 \int_0^1 f(z)\, \sin(n\pi z) \,dz

the coefficients for a Fourier-Bessel series are given by

c_n = \frac{ 2\int_0^1 f(z)\, J_0(\lambda_n z) \, dz}{ J_1(\lambda_n)^2 }.

Gibbs phenomenon

Fourier and Fourier-Bessel series are examples of orthogonal series, and so by construction they converge in the norm given by their associated inner product. That means that if SN is the Nth partial sum of a Fourier series

\lim_{N\to\infty} \int_0^1 \left( f(x) - S_N(x) \right)^2 \, dz = 0
and the analogous statement for a Fourier-Bessel series is

\lim_{N\to\infty} \int_0^1 \left( f(x) - S_N(x) \right)^2 \, z\, dz = 0.

In short, the series converge in a (weighted) L² norm. But how do the series converge pointwise? A lot of harmonic analysis is devoted to answering this question, what conditions on the function f guarantee what kind of behavior of the partial sums of the series expansion.

If we look at the Fourier series for a step function, the partial sums converge pointwise everywhere except at the step discontinuity. But the way they converge is interesting. You get a sort of “bat ear” phenomena where the partial sums overshoot the step function at the discontinuity. This is called the Gibbs phenomenon after Josiah Willard Gibbs who observed the effect in 1899. (Henry Wilbraham observed the same thing earlier, but Gibbs didn’t know that.)

The Gibbs phenomena is well known for Fourier series. It’s not as well known that the same phenomenon occurs for other orthogonal series, such as Fourier-Bessel series. I’ll give an example of Gibbs phenomenon for Fourier-Bessel series taken from [2] and give Python code to visualize it.

We take our function f(z) to be 1 on [0, 1/2] and 0 on (1/2, 1]. It works out that

c_n = \frac{1}{\lambda_n} \frac{J_1(\lambda_n / 2)}{ J_1(\lambda_n)^2 }.

Python code and plot

Here’s the plot with 100 terms. Notice how the partial sums overshoot the mark to the left of 1/2 and undershoot to the right of 1/2.

Plot showing Gibbs phenomena for Fourier-Bessel series

Here’s the same plot with 1,000 terms.

Gibbs phenomena for 1000 terms of Fourier-Bessel series

Here’s the Python code that produced the plot.

    import matplotlib.pyplot as plt
    from scipy.special import j0, j1, jn_zeros
    from scipy import linspace

    N = 100 # number of terms in series

    roots = jn_zeros(0, N)
    coeff = [j1(r/2) / (r*j1(r)**2) for r in roots]
    z = linspace(0, 1, 200)

    def partial_sum(z):
        return sum([coeff[i]*j0(roots[i]*z) for i in range(N)])

    plt.plot(z, partial_sum(z))
    plt.xlabel("z")
    plt.ylabel("{}th partial sum".format(N))
    plt.show()

Footnotes

[1] To be precise, as z goes to infinity

J_nu(z) sim sqrt{frac{2}{pi z}} cosleft(z - frac{1}{2} nu pi - frac{pi}{4} right)

and so the Bessel functions are asymptotically proportional to sin(z – φ)/√z for some phase shift φ.

[2] The Gibbs’ phenomenon for Fourier-Bessel Series. Temple H. Fay and P. Kendrik Kloppers. International Journal of Mathematical Education in Science and Technology. 2003. vol. 323, no. 2, 199-217.

Animated exponential sum

I’m experimenting with making animated versions of the kinds of images I wrote about in my previous post. Here’s an animated version of the exponential sum of the day for 12/4/17.

Why that date? I wanted to start with something with a fairly small period, and that one looked interesting. I’ll have to do something different for the images that have a much longer period.

Image made in collaboration with Go 3D Now.

Animated gif of exponential sum

 

Database anonymization for testing

How do you create a database for testing that is like your production database? It depends on in what way you want the test database to be “like” the production one.

Replacing sensitive data

Companies often use an old version of their production database for testing. But what if the production database has sensitive information that software developers and testers should not have access to?

You can’t completely remove customer phone numbers from the database, for example, if your software handles customer phone numbers. You have to replace in sensitive data with modified data. The question becomes how to modify it. Three approaches would be

  1. Use the original data.
  2. Generate completely new artificial data.
  3. Use the real data as a guide to generating new data.

We’ll assume the first option is off the table and consider the pros and cons of the other two options.

For example, suppose you collect customer ages. You could replace customer age with a random two-digit number. That’s fine as far as making sure that forms can display two-digit numbers. But maybe the age values matter. Maybe you want your fictional customers in the test database to have the same age distribution as your real customers. Or maybe you want your fictional customer ages to be correlated with other attributes so that you don’t have 11 year-old retirees or 98 year-old clients who can’t legally purchase alcohol.

Random vs realistic

There are pros and cons to having a realistic test database. A database filled with randomly generated data is likely to find more bugs, but a realistic database is likely to find more important bugs.

Randomly generated data may contain combinations that have yet to occur in the production data, combinations that will cause an error when they do come up in production. Maybe you’ve never sold your product to someone in Louisiana, and there’s a latent bug that will show up the first time someone from Louisiana does order. (For example, Louisiana retains vestiges of French law that make it different from all other states.)

On the other hand, randomly generated data may not find the bugs that affect the most customers. You might want the values in your test database to be distributed similarly to the values in real data so that bugs come up in testing with roughly the same frequency as in production. In that case, you probably want the joint distributions to match and not just the unconditional distributions. If you just match the latter, you could run into oddities such as a large number of teenage retirees as mentioned above.

So do you want a random test database or a realistic test database? Maybe both. It depends on your purposes and priorities. You might want to start by testing against a realistic database so that you first find the bugs that are likely to affect the most number of customers. Then maybe you switch to a randomized database that is more effective at flushing out problems with edge cases.

How to make a realistic test database

So how would you go about creating a realistic test database that protects customer privacy? The answer depends on several factors. First of all, it depends on what aspects of the real data you want to preserve. Maybe verisimilitude is more important for some fields than others. Once you decide what aspects you want your test database to approximate, how well do you need to approximate them? If you want to do valid statistical analysis on the test database, you may need something sophisticated like differential privacy. But if you just want moderately realistic test cases, you can do something much simpler.

Finally, you have to address your privacy-utility trade-off. What kinds of privacy protection are you ethically and legally obligated to provide? For example, is your data consider PHI under HIPAA regulation? Once your privacy obligations are clear, you look for ways to maximize your utility subject to these privacy constraints.

If you’d like help with this process, let’s talk. I can help you determine what your obligations are and how best to meet them while meeting your business objectives.

Recent exponential sums

The exponential sum of the day draws a line between consecutive partial sums of

\sum_{n=0}^N \exp\left( 2\pi i \left( \frac{n}{m} + \frac{n^2}{d} + \frac{n^3}{y} \right ) \right )

where md, and y are the current month, day, and two-digit year. The four most recent images show how different these plots can be.

exponential sums

These images are from 10/30/17, 10/31/17, 11/1/17, and 11/2/17.

Consecutive dates often produce very different images for a couple reasons. First, consecutive integers are relatively prime. From a number theoretic perspective, 30 and 31 are very different, for example. (This touches on the motivation for p-adic numbers: put a different metric on integers, one based on their prime factorization.)

The other reason consecutive dates produce qualitatively different images is that you might roll over a month or year, such as going from October (10) to November (11). You’ll see a big change when we roll over from 2017 to 2018.

The partial sums are periodic [1] with period lcm(mdy). The image for 10/31/17 above has the most points because 10, 31, and 17 are relatively prime. It’s also true that 11, 2, and 17 are relatively prime, but these are smaller numbers.

You could think of the month, day, and year components of the sum as three different gears. The sums repeat when all three gears return to their initial positions. In the image for yesterday, 11/1/17, the middle gear is effectively not there.

[1] The terms of the sums are periodic. The partial sums are periodic if the full sum is zero. So if the partial sums are periodic, the lcm is a period.