Buy one, get one free

phase portrait k = 1/2

The two latest blog post have elaborated on the fact that when a function satisfies a second order differential equation, once you’ve evaluated the function and its derivative you get the second derivative for free, or at least for cheap.

Sometimes functions are so closely related that software libraries bundle them together. For example, in Python’s scientific computing library SciPy there is no function for returning just one Jacobi elliptic function. Because it takes about as much work to compute one of the functions sn, cn, dn, and ph as it does to compute all of them, SciPy provides a function ellipj that returns all four at once. If you only want to compute sn, call ellipj and discard all but the first of the four values returned. No loss. But if you need two or more of the functions, the bundling saves compute time.

Jacobi functions

I haven’t looked at how SciPy computes Jacobi functions, but I imagine the code to compute the various functions overlaps so much that the addition cost of computing all the functions is minimal if you’re going to compute one of them. Jacobi functions are related to theta functions, and theta functions can be computed very efficiently.

I wrote about the interconnections between Jacobi functions in an earlier post. The three functions sn, cn, and dn all satisfy a simple system of differential equations

\begin{eqnarray*} x' &=& yz \\ y' &=& -zx \\ z' &=& -k^2 xy \end{eqnarray*}

with initial conditions x(0) = 0, y(0) = 1, and z(0) = 1. The image at the top of the post plots (sn(t, cn(t), dn(t)) and is taken from that post. The Jacobi functions are analogous to trig functions and like trig functions they satisfy myriad identities relating the functions to each other.

Normal random values

An example BOGO (buy one, get one free) in probability is generating samples from a normal random variable. The most common approach generates two uniform samples and transforms them into two normal samples. So if you ask the software for one normal random value, it could give you a second one at no extra computational expense. Using this approach, a function to compute a vector of 2n random values could do so in the time it would take to generate 1 random value n times.

Automatic differentiation

A final example that I’ll mention is automatic differentiation. With this technique it is often possible to evaluate a function and its gradient in little more time than it takes to evaluate the function alone. Having an explicit form of the derivative is not only unnecessary, it may be less efficient than letting automatic differentiation compute the derivative.

Cautionary example

On the flip side, it sometimes seems that you can buy one and get one free when you can’t or shouldn’t. For example, software libraries often have routines to evaluate the error function erf(x) and its complement erfc(x) = 1 – erf(x). In theory you can calculate one from the other, and often in practice too. But if erf(x) is close to 1, 1 – erf(x) might evaluate to zero in floating point arithmetic even though erfc(x) is not zero. For more examples along these lines, see Math functions that seem unnecessary.

Halley’s variation on Newton’s method

Newton’s method for solving f(x) = 0 converges quickly once it starts to converge. Specifically, it converges quadratically: the error is roughly squared at each step. This means that the number of correct decimal places doubles at each iteration of Newton’s method.

Doubling the number of decimal places is nice, but what if we could triple the number of correct decimals at each step? Edmond Halley, best known for the comet that bears his name, came up with a way to do just that. Newton’s method converges quadratically, but Halley’s method converges cubically.

Newton’s method updates iterations using the rule

x_{n+1} = x_n + \frac{f(x_n)}{f'(x_n)}

Halley’s variation uses the rule

x_{n+1} = x_n - \frac{1}{\dfrac{f'(x_n)}{f(x_n)} - \dfrac{f''(x)}{2f'(x)}}

Halley’s method makes more progress per iteration, but it also takes more work per iteration, and so in practice it’s not usually an improvement over Newton’s method. But in some cases it could be.

Evaluating efficiency

In root-finding problems, the function f is usually relatively time-consuming to evaluate. We can assume nearly all the work in root-finding goes into evaluating f and its derivatives, and we can ignore the rest of the arithmetic in the method.

Newton’s method requires evaluating the function f and its derivative f ′. Halley’s method requires these function evaluates as well and also requires evaluating the second derivative f ′′.

Three iterations of Newton’s method make as much progress toward finding a root as two iterations of Halley’s method. But since Newton’s method requires two function evaluations and Halley’s method requires three, both require six function evaluations to make the same amount of progress. So in general Halley’s method is not an improvement over Newton’s method.

When Halley might be better

Suppose we had a way to quickly compute the second derivative of f at a point from the values of f and its first derivative at that point, rather than by having to evaluate the second derivative explicitly. In that case two iterations of Halley’s method might make as much progress as three iterations of Newton’s method but with less effort, four function evaluations rather than six.

Many of the functions that are used most in applied mathematics satisfy 2nd order differential equations, and so it is common to be able to compute the second derivative from the function and its derivative.

For example, suppose we want to find the zeros of Bessel functions Jν. Bessel functions satisfy Bessel’s differential equation; that’s where Bessel functions get their name.

x^2 y'' + x y' + \left(x^2 - \nu^2 \right) y = 0

This means that once we have evaluated Jν and its first derivative, we can compute the second derivative from these values as follows.

J_{\nu}''(x_n) = \frac{(\nu^2 - x_n^2)J_\nu(x_n) - x_nJ_\nu'(x_n)}{x_n^2}

Maybe we’re not working with a well-studied function like a Bessel function but rather we are numerically solving a differential equation of our own creation and we want to know where the solution is zero. Halley’s method would be a natural fit because our numerical method will give us the values of y and y ′ at each step.

Related posts

Activation functions and Iverson brackets

Neural network activation functions transform the output of one layer of the neural net into the input for another layer. These functions are nonlinear because the universal approximation theorem, the theorem that basically says a two-layer neural net can approximate any function, requires these functions to be nonlinear.

Heaviside function plot

Activation functions often have two-part definitions, defined one way for negative inputs and another way for positive inputs, and so they’re ideal for Iverson notation. For example, the Heaviside function plotted above is defined to be

f(x) = \left\{ \begin{array}{ll} 1 & \mbox{if } x > 0 \\ 0 & \mbox{if } x \leq 0 \end{array} \right.

Kenneth Iverson’s bracket notation, first developed for the APL programming language but adopted more widely, uses brackets around a Boolean expression to indicate the function that is 1 when the expression is true and 0 otherwise. With this notation, the Heaviside function can be written simply as

f(x) = [x > 0]

Iverson notation is fairly common, but not quite so common that I feel like I can use it without explanation. I find it very handy and would like to popularize it. The result of the post will give more examples.

ReLU

The popular ReLU (rectified linear unit) function is defined as

f(x) = \left\{ \begin{array}{ll} x & \mbox{if } x > 0 \\ 0 & \mbox{if } x \leq 0 \end{array} \right.

and with Iverson bracket notation as

f(x) = x[ x> 0]

The ReLU activation function is the identity function multiplied by the Heaviside function. It’s not the best example of the value of bracket notation since it could be written simply as max(0, x). The next example is better.

ELU

ELU

The ELU (exponential linear unit) is a variation on the ReLU that, unlike the ReLU, is differentiable at 0.

f(x) = \left\{ \begin{array}{ll} x & \mbox{if } x > 0 \\ e^x - 1 & \mbox{if } x \leq 0 \end{array} \right.

The ELU can be described succinctly in bracket notation.

f(x) = (e^x-1)[ x \leq 0] + x[x > 0]

PReLU

PReLU -- parameterized rectified linear unit -- graph

The PReLU (parametric rectified linear unit) depends on a small positive parameter a. This parameter must not equal 1, because then the function would be linear and the universal approximation theorem would not apply.

f(x) = \left\{ \begin{array}{ll} x & \mbox{if } x > 0 \\ ax & \mbox{if } x \leq 0 \end{array} \right.

In Iverson notation:

f(x) = ax[ x \leq 0] + x[x > 0]

Related posts

Enumerating monotone Boolean functions

The 9th Dedekind number was recently computed. What is a Dedekind number and why was it a big deal to compute just the 9th one of them?

We need to define a couple terms before we can define Dedekind numbers.

A Boolean function is a function whose inputs are 0’s and 1’s and whose output is 0 or 1.

A function f is monotone if increasing the input cannot decrease the output:

xyf(x) ≤ f(y).

Obviously a monotone Boolean function is a Boolean function that is monotone, but monotone with respect to what order? How are we to define when xy when x and y are sequences of bits?

There are numerous ways one might order the inputs, but the conventional order [1] in this context is to say x ≤ y if every bit in x is less than or equal to the corresponding bit in y. So if the ith bit of x is a 1, then the ith bit of y must be a 1.

A Boolean function is monotone if and only if flipping an input bit from 0 to 1 cannot change the output from 1 to 0.

Enumerating monotone Boolean functions

The nth Dedekind number M(n) is the number of monotone Boolean functions of n variables. We’ll enumerate a few of these. Let a, b, c and d be Boolean variables and denote AND by ∧ and OR by ∨. As usual, we assume ∧ is higher precedence than ∨ so that, for example,

xyz

means

x ∨ (yz).

One variable

There are three monotone functions of one variable a: always return 0, always return a, and always return 1.

  • 0
  • a
  • 1

The only Boolean function of one variable that isn’t monotone is the function that flips a, i.e. f(a) = ¬a.

Two variables

There are six monotone Boolean functions with two variables:

  • 0
  • a
  • b
  • a ∧ b
  • a ∨ b
  • 1

and so M(2) = 6.

We can verify that the six functions above are monotone with the following Python code.

    from itertools import product
    
    f = [None]*6
    f[0] = lambda a, b: 0
    f[1] = lambda a, b: a
    f[2] = lambda a, b: b
    f[3] = lambda a, b: a | b 
    f[4] = lambda a, b: a & b
    f[5] = lambda a, b: 1
    
    for i in range(6):
        for (a, b) in product((0,1), repeat=2):
            for (x, y) in product((0,1), repeat=2):
                if a <= x and b <= y:
                    assert(f[i](a, b) <= f[i](x, y))

Three variables

There are 20 monotone Boolean functions of three variables:

  • 0
  • a
  • b
  • c
  • a ∧ b
  • bc
  • ac
  • ab
  • bc
  • ac
  • abc
  • bca
  • acb
  • abbc
  • acbc
  • abac
  • abbcac
  • abc
  • abc

and so M(3) = 20.

As before, we can verify that the functions above are monotone with a script.

    g = [None]*20
    g[ 0] = lambda a, b, c: 0
    g[ 1] = lambda a, b, c: a 
    g[ 2] = lambda a, b, c: b
    g[ 3] = lambda a, b, c: c
    g[ 4] = lambda a, b, c: a & b
    g[ 5] = lambda a, b, c: b & c
    g[ 6] = lambda a, b, c: a & c
    g[ 7] = lambda a, b, c: a | b
    g[ 8] = lambda a, b, c: b | c
    g[ 9] = lambda a, b, c: a | c
    g[10] = lambda a, b, c: a & b | c
    g[11] = lambda a, b, c: b & c | a
    g[12] = lambda a, b, c: a & c | b
    g[13] = lambda a, b, c: a & b | b & c
    g[14] = lambda a, b, c: a & c | b & c
    g[15] = lambda a, b, c: a & b | a & c
    g[16] = lambda a, b, c: a & b | b & c | a & c
    g[17] = lambda a, b, c: a & b & c
    g[18] = lambda a, b, c: a | b | c 
    g[19] = lambda a, b, c: 1
    
    for i in range(20):
        for (a, b, c) in product((0,1), repeat=3):
            for (x, y, z) in product((0,1), repeat=3):
                if a <= x and b <= y and c <= z:
                    assert(g[i](a, b, c) <= g[i](x, y, z))

More variables

The concrete approach to enumerating monotone Boolean functions does not scale. There are 168 monotone functions of four variables, 7581 of five variables, and 7,828,354 functions of six variables. The Dedekind numbers M(n) grow very quickly. The next post will quantify just how quickly.

 

[1] This “order” is technically a partial order. If x = (0, 1) and y = (1, 0) then x and y are not comparable; neither is less than or equal to the other.

Plotting a function with a lot of local minima

Stefan Steinerberger defines “an amusing sequence of functions” in [1] by

f_n(x) = \sum_{k=1}^n \frac{|\sin( k\pi x )|}{k}

Here’s a plot of f30(x):

As you can see, fn(x) has a lot of local minima, and the number of local minima increases rapidly with n.

Here’s a naive attempt to produce the plot above using Python.

    from numpy import sin, pi, linspace
    import matplotlib.pyplot as plt

    def f(x, n):
        return sum( abs(sin(k*pi*x))/k  for k in range(1, n+1) )

    x = linspace(0, 1)
    plt.show()

This produces the following.

The code above doesn’t specify how finely to divide the interval [0, 1], and by default linspace will produce 50 evenly spaced points. In the example above we need a lot more points. But how many? This is not a simple question.

You could try more points, adding a third argument to linspace, say 1000. That will produce something that looks more like the first plot, but is the first plot right?

The first plot was produced by the Mathematica statements

    f[x_, n_] := Sum[Abs[Sin[Pi k x]/k], {k, 1, n}]
    Plot[f[x, 30], {x, 0, 1}]

Mathematica adaptively determines how many plotting points to use. It tries to detect when a function is changing rapidly in some region and allocate more points there. The function that is the subject of this post makes a good test case for such automated refinement.

Often Mathematica does a good enough job automatically, but sometimes it needs help. The Plot function takes additional parameters like PlotPoints and MaxRecursion that you can use to tell Mathematica to work harder.

It would be an interesting exercise to make a high quality plot of fn for some moderately large n, making sure you’ve captured all the local minima.

[1] Stefan Steinerberger. An Amusing Sequence of Functions. Mathematics Magazine, Vol. 91, No. 4 (October 2018), pp. 262–266

Productive productivity

I skimmed Automate Your Busywork the other day and realized I already have automated most of my busywork. I don’t have a lot of repetitive tasks to do, and I’ve written scripts to streamline most of the repetitive tasks I do have.

The scripts that have been most useful are of zero interest to anyone else because they are very specific to my work. I imagine that’s true of most scripts ever written.

Here are a couple things that have improved my productivity that may be of some general interest.

I found it helpful to remap a few keys so I can have cross platform muscle memory. That is, the same keys do the same thing whether I’m using Mac, Windows, or Linux. (And to some extent Emacs, which is kinda like its own operating system.) And by the same keys, I mean the same key positions, not the same key labels.

I’ve also found it helpful do as much as practical with plain text files and tools designed to work with plain text files. Plain text files are small, easy to search, and amenable to version control. They are robust and transparent. And they’re interoperable. I’ve found, for example, that even if a task is easier to do in Microsoft Word than in LaTeX, it’s better for my long-term productivity to use LaTeX.

Both of these are rather pedestrian tips, not photogenic or technically impressive, but useful, at least in my experience. And I suspect that the more likely a tip is to catch your attention when surfing the web, the less likely it is to be genuinely useful. Of course your mileage may vary.

Computing Stirling numbers with limited integers

A couple days ago I wrote a post about a probability problem that involved calculating Stirling numbers. There are two kinds of Stirling numbers, creatively called “Stirling numbers of the first kind” and “Stirling numbers of the second kind.” The second kind come up more often in application, and so when authors say “Stirling numbers” without qualification, they mean the second kind. That’s the convention I’ll use here.

The Stirling number S(n, k) can be computed via the following series, though we will take a different approach for reasons explained below.

S(n,k) = \frac{1}{k!}\sum_{i = 0}^k (-1)^i {k \choose i} (k-i)^n

Problem statement

This made me think of the following problem. Suppose you want to calculate Stirling numbers. You want to use integer arithmetic in order to get exact results.

And suppose you can use integers as large as you like, but you have to specify the maximum integer size before you start calculating. Imagine you have to pay $N to work with N-digit integers, so you want to not use a larger value of N than necessary.

Intermediate results

There are a couple problems with using the equation above. Direct implementation of the equation would first calculate k! S(n, k) first, then divide by k! to get S(n, k). This would be costly because it would require calculating a number k! times bigger than the desired final result. (You could refactor the equation so that there is no division by k! at the end, but this gives an expression that includes non-integer terms.)

In the earlier post, I actually needed to calculate k! S(n, k) rather than S(n, k) and so this wasn’t a problem. This brings up the second problem with the equation above. It’s an alternating series, and it’s not clear whether evaluating the series produces intermediate results that are larger than the final result.

Recursion

Stirling numbers can be computed recursively using

S(n+1, k) = k S(n,k) + S(n, k-1)

with the base cases S(n, n) = 1 for n ≥ 0 and S(n, 0) = S(0, n) for n > 0. This computes Stirling numbers via a sum of smaller Stirling numbers, and so the intermediate results are no larger than the final result.

So now we’re left with the question of how big Stirling numbers can be. Before computing S(n, k) you need an upper bound on how many digits it will have.

Bounding Stirling numbers

Stirling numbers can be bound in terms of binomial coefficients.

S(n, k) \leq \frac{1}{2} \binom{n}{k}k^{n-k}

This reduces the problem to finding an upper bound on binomial coefficients. I wrote about a simple bound on binomial coefficients here.

Data file character frequencies

I have a little script that will print the frequency of the most common characters in a file and the number of lines. All numbers are displayed along with their factorizations. It also prints the number of non-ASCII characters.

CSV files

These simple statistics are surprisingly useful. For example, when I ran it on an CSV file that I downloaded recently I got the following.

    ,   397907424  =  2^5 3^3 19 24239 
    0   58200944  =  2^4 1699 2141 
    2   52955465  =  5 467 22679 
    1   46413310  =  2 5 23 201797 
    3   34811225  =  5^2 1392449 

    Num lines:  1745208  =  2^3 3^2 24239 

    All ASCII characters

This strongly implies that the CSV file really is a CSV (comma-separated value) file. Sometimes you’ll get a file with a .csv extension but the separator is a tab, a pipe, or some other character.

The number of commas is a multiple of the number of lines. That’s a good sign. Apparently this is a CSV file with 12×19 columns and 1,745,208 rows [1]. If the number of separators is not a multiple of the number of lines, maybe some lines are incomplete. Or maybe your file separator appears inside a quoted string. This is not necessarily a problem, but it means the most naive parsing won’t work.

In the file above, the most common characters, other than commas, are digits, so the file probably contains mostly numeric data.

If your file contains quotation marks, better hope it contains an even number. Even better, an even multiple of the number of lines. If not, you have some troubleshooting to do.

Incidentally, whenever the subject of CSV files comes up, someone will say “Why are you using CSV files?! Don’t you know there are better formats?” My reply is that I’m a consultant and I take data in whatever format I can get it, and that most often means a delimiter-separated text file. That works fine, except, of course, when it doesn’t.

Unicode characters

A file with lots of non-ASCII characters is not a problem. A file with one non-ASCII character very often is a problem.

A single non-ASCII character could be an invisible character that will gum up parsing. This can be maddening to find if you’re relying on visual inspection. But if you know there’s a non-ASCII character where it shouldn’t be, such as in a file of digits and commas, then you can simply delete it.

JSON

If you’re inspecting a JSON file, you’d expect to see lots of braces. Hopefully you have an equal number of open and close braces. But if not, you know where to being troubleshooting. You should also expect a lot of colons. Knowing the number of braces and colons gives you a clue to the structure of the file.

Troubleshooting and guessing

When a file has no complications, the stats above tell you things you’d know from looking at the first line or two of the file. However, when there are complications, the stats can be useful.

The stats could also be useful in a context where it’s OK to make guesses. For example, you might have a script that guesses the structure of a file and proceeds accordingly. That’s fine when wrong guesses lead to obviously wrong output. It’s hard to imagine, for example, that mistaking an XML file for a CVS file would produce a subtle error.

Related posts

[1] The number of fields is one more than the number of separators. Or maybe not: there may be a trailing separator after the last field. So in this case there may be 228 or 229 columns.

Query, then deidentify

Suppose you have a database of personally identifiable information (PII) and you want to allow someone else to query the data while protecting the privacy of the individuals represented by the data. There are two approaches:

  1. Deidentify, then query
  2. Query, then deidentify

The first approach is to do whatever is necessary to deidentify the data—remove some fields, truncate or randomize others, etc.—and then pose a query to this redacted data.

The second approach is to query the original data, then do whatever is necessary to deidentify the results.

In graphical terms, you can get from raw data to a deidentified result either by following the green arrows or the blue arrows below. In mathematical terms, this diagram does not commute.

The first approach is most common. A company that owns data (a “covered entity” in HIPAA terms) will deidentify it and license it to another company who then queries it. The second approach is becoming more common, where a company will license access to querying their data.

Pros and cons

Which approach is better? If by better you mean more accurate results, it’s always best to query first then deidentify. The order in which you do things matters, and deidentifying as late as possible preserves information.

The situation is analogous to carrying out a sequence of steps on a calculator. If you want your final result to be accurate to two decimal places, you first carry out all your operations to as much precision as you can, then round the final result. If you round your numbers first, you probably will get less accurate results, maybe even useless results.

However, deidentifying data before querying it is better in some non-mathematical ways. Data scientists want the convenience of working with the data with their tools in their environment. They want to possess (a deidentified version of) the data rather than have access to query the (exact) data. They also want the freedom to run ad hoc queries [1].

There are logistical and legal details to work out in order to license access to query data rather than licensing the data. But it is doable, and companies are doing it.

Why query first

When you deidentify data first, you have to guard against every possible use of the data. But when you deidentify data last, you only have to guard against the actual use of the data.

For example, suppose you are considering creating a new clinic and you would like to know how many patients of a certain type live closer to the location you have in mind than the nearest alternative. A data vendor cannot give you exact locations of patients. If they were to release such data, they’d have to obscure the addresses somehow, such as giving you the first three digits of zip codes rather than full addresses. But if you could ask your query of someone holding the full data, they may tell you exactly what you want to know.

Some queries may pose no privacy risk, and the data holder can return exact results. Or they may need to jitter the result a bit in order to protect privacy, for reasons explained here. But it’s better to jitter an exact result than to jitter your data before computing.

How to query first

The query-first approach requires a trusted party to hold the unredacted data. There are a variety of ways the data holder can license access, from simple to sophisticated, and in between.

The simplest approach would be for the data holder to sell reports. Maybe the data holder offers a predetermined set of reports, or maybe they allow requests.

The most sophisticated approach would be to use differential privacy. Clients are allowed to pose any query they wish, and a query manager automatically adds an amount of randomness to the results in proportion to the sensitivity of the query. All this is done automatically according to a mathematical model of privacy with no need for anyone to decide a priori which queries will be allowed.

There are approaches conceptually between pre-determined reports and differential privacy, offering more flexibility than the former and being easier to implement than the latter. There’s a lot of room for creativity in this space.

Related posts

[1] Being able to run ad hoc queries with no privacy budget is certainly simpler, in the same way that an all-you-can-eat buffet is simpler than ordering food à la carte. But it also means the price is higher. Deidentifying an entire data set entails more loss of accuracy that deidentifying a set of queries.

Identifiable to man or machine?

Like the previous post, this post riffs on a photo [1] I stumbled on while looking for something else.

Would it be easier to identify the man in this photo or the man whose photo appeared in the previous post, copied below.

I think it would be easier for a human to recognize the person in the first image. But what about a computer?

We humans identify people most easily by their faces, and especially by their eyes. These features are easier to see in the first photo. But what might we find if we applied some image processing to the two photos? Maybe the green man’s facial features could be exposed by some diligent processing. We see more of the second man’s body. Maybe a computer algorithm could extract more information out of the second image for this reason.

Photographs may, and often do, contain Exif (Exchangeable image file format) metadata, such as the GPS coordinates of the camera at the time the photo was taken. A photo taken when the lens cap on by mistake might contain a good deal of information about the subject even though the photo per se is useless. This information can be useful to the photographer, but it could also pose a privacy risk. Before posting a photo publicly, you might want to strip out the metadata.

As I noted in the previous post, the privacy risk from data depends on context. Suppose the metadata embedded in a photo contains the serial number of the camera. That serial number would not help most people identify a photo(grapher), but it would someone who had access to a database linking serial numbers to the customers who purchased the cameras.

Was the first photo created by actually projecting the red and green lights onto the subject, or were these added in post production? For that matter, was there actually a man who posed for the photo or was the image synthetically generated? A forensic investigation of the photo might be able to answer these questions.

[1] Photo by Sebastian Mark on Unsplash