Yet another way to define fractional derivatives

Fractional integrals are easier to define than fractional derivatives. And for sufficiently smooth functions, you can use the former to define the latter.

The Riemann-Liouville fractional integral starts from the observation that for positive integer n,

I^n f(x) &\equiv& \int_a^{x} \int_a^{x_1} \cdots \int_a^{x_{n-1}} f(x_n)\,dx_n\, dx_{n-1} \cdots \,dx_1 \\ &=& \frac{1}{(n-1)!} \int_a^x (x-t)^{n-1} f(t)\, dt

This motivates a definition of fractional integrals

I^\alpha f(x) = \frac{1}{\Gamma(\alpha)} \int_a^x (x-t)^{\alpha-1} f(t)\, dt

which is valid for any complex α with positive real part. Derivatives and integrals are inverses for integer degree, and we use this to define fractional derivatives: the derivative of degree n is the integral of degree –n. So if we could define fractional integrals for any degree, we could define a derivative of degree α to be an integral of degree -α.

Unfortunately we can’t do this directly since our definition only converges in the right half-plane. But for (ordinary) differentiable f, we can integrate the Riemann-Liouville definition of fractional integral by parts:

I^\alpha f(x) = \frac{(x-a)^\alpha}{\Gamma(\alpha+1)} f(a) + I^{\alpha+1} f'(x)

We can use the right side of this equation to define the left side when the real part of α is bigger than -1. And if f has two ordinary derivatives, we can repeat this process to define fractional integrals for α with real part bigger than -2. We can repeat this process to define the fractional integrals (and hence fractional derivatives) for any degree we like, provided the function has enough ordinary derivatives.

See previous posts for two other ways of defining fractional derivatives, via Fourier transforms and via the binomial theorem.

Family tree numbering

When you draw a tree of your ancestors, things quickly get out of hand. There are twice as many nodes each time you go back a generation, and so the size of paper you need grows exponentially. Things also get messy because typically you know much more about some lines than others. If you know much about your ancestry, one big tree isn’t going to work.

Ahnentafel numbering system from 1590

There’s a simple solution to this problem, one commonly used in genealogy: assign everyone in the tree a number, starting with yourself as 1. Then follow two simple rules:

  1. The father of person n has number 2n.
  2. The mother of person n has number 2n + 1.

You can tell where someone fits into the tree easily from their number. Men have even numbers, women odd numbers. The number of someone’s child is half their number (rounding down if you get a fraction). For example, person 75 on your tree must be a woman. Her husband would be 74, her child 37, her father 150, etc.

Taking the logarithm base 2 tells you how many generations back someone is. That is, person n is ⌊ log2n ⌋ generations back. Going back to our example of 75, this person would be 6 generations back because log2 75 = 6.2288. (Here ⌊ x ⌋ is the “floor” of x, the largest integer less than x. Using the same notation, the child of n is ⌊ n/2 ⌋.)

Said another way, the people m generations back have numbers 2m through 2m+1 – 1. Your paternal line has numbers equal to powers of 2, and your maternal line has numbers one less than powers of 2.

If you write out a person’s number in binary, you stick a 0 on the end to find their father and a 1 on the end to find their mother. So your paternal grandmother, for example, would have number 101 in binary. Start with your number: 1. Then tack on a zero for your father: 10. Then tack on a 1 for his mother: 101.

In our example of 75 above, this number is 1001011 in binary. Leave off the one on the left, then read from left to right saying “father” every time you see a 0 and “mother” every time you see a 1. So person 75 is your father’s father’s mother’s father’s mother’s mother.

This numbering system goes back to at least 1590. In that year Michaël Eytzinger published the chart in the image above, giving the genealogy of Henry III of France.

Related posts:

Magic hexagon

The following figure is a magic hexagon: the numbers in any straight path through the figure add to 38, even though paths may have length three, four, or five.

I found this in Before Sudoku. The authors attribute it to Madachy’s Mathematical Recreations.

This is essentially the only magic hexagon filled with consecutive integers starting with one. The only others are rotations or reflections of this one, or the trivial case of a single hexagon.

Related posts:

Rigor and Vigor in Mathematics

I just started reading Frequency Analysis, Modulation and Noise by Stanford Goldman. The writing is strikingly elegant and clear. Here is a paragraph from the introduction.

Rigorous mathematics has a rightful place of honor in human thought. However, it has wisely been said that vigor is more important than rigor in the use of mathematics by the average man. In the particular case of this volume, the amount of rigor will be used that is necessary for a thorough understanding of the subject at hand by a radio engineer; but when it appears that rigor will confuse rather than clarify the subject for an engineer, we shall trust in the correctness of the results established by rigorous methods by the pure mathematicians and use them without the background of a rigorous proof.

The back of the book says  “Professor Goldman’s exposition is both mathematically and physically enlightening and it is unusually well written.” So far I agree.

(I found the 1967 Dover paperback reprint of the original 1948 hardback at a used book store. I looked at Dover’s site while writing this and it doesn’t seem to be in print.)

Two meanings of distribution

There are a couple common uses of the term distribution in math. The most familiar is probability distribution, such as a beta distribution, a Poisson distribution, etc. Less familiar but still common is distributions in the sense of generalized functions, like the Dirac delta distribution. Anybody with much exposure to math will have heard of a probability distribution. Generalized functions are common knowledge in some areas of math such as differential equations or harmonic analysis, but mathematicians in other areas, say graph theory, may not have heard of them.

This post briefly answers two questions:

  1. What is a distribution as in a generalized function?
  2. What does it have to do with a probability distribution?

Most of this post will deal with the first question, but we’ll circle back to the second question by the end.

You may have heard that a Dirac delta function δ(x) is an “infinitely concentrated” function or a point mass. Or you may have heard some of the rules for working with it, such as that it is infinite at the origin, zero everywhere else, and integrates to 1. But no function can actually do what the delta function is said to do. Measure theory will let functions take on actual infinite values, but the value of a function at a single point, even if that value is infinite, cannot matter to its integral. Even putting that aside, if you say δ(x) is infinite at 0 and integrates to 1, then how do you make sense of expressions like 2 δ? Is it twice as infinite at 0, whatever that means? Is it twice as zero everywhere else? And what on earth could it mean to take a derivative or Fourier transform of δ(x)?

Generalized functions are a way to define things like the δ distribution rigorously. They let you preserve some of the intuitive/magical properties you want while also giving rules to keep you from getting into trouble. Regarding the paragraph above, the theory will let you integrate, differentiate, and take the Fourier transform of δ(x) but it won’t let you do things like say that 2δ = δ since 2×0 = 0 and 2×∞ = ∞.

Generalized functions are just functions, but not functions of real numbers. They are linear functions that take other functions [1] and return real numbers. The functions they act on are typically called test functions. To reduce the confusion of having different kinds of functions under discussion, linear functions that act on other functions are usually called functionals. A functional is just a function, a linear function from test functions to real numbers, but it helps to give it a different name.

You can write the action of a distribution f on a test function φ as if it were an integral:

f : \varphi \mapsto \int f(x)\, \varphi(x) \, dx

If f is a function, you can take the integral literally. Distributions generalize functions by associating a function with the linear operator that acts on a test function by multiplying by it and integrating.

But distributions include other kinds of linear functionals, in which case the integral expression is not literal. The δ distribution, for example, acts on a test function φ by returning φ(0). And here’s the connection to the intuitive idea of a function infinitely concentrated at 0. If a function integrates to 1, and is very concentrated near 0, then its integral when multiplied by φ is approximately φ(0).  You could make this rigorous and define generalized functions as limits of functions, but that approach is something of a dead end. The theory is much simpler using the linear functional definition.

How does this let you differentiate things like the Dirac delta? In a nutshell, you take what is a theorem for ordinary functions and turn it into a definition for generalized functions. I explain this in more detail here.

So the theory of distributions lets you use your intuition regarding “infinitely concentrated” functions and such. It also lets you carry out formal calculations, such as differentiating or taking the Fourier transform of distributions. But it also keeps you out of trouble. Back to our example above, what does 2δ mean? It’s simply the linear functional that takes a test function φ and returns 2 φ(0).

Now what does all this have to do with probability distributions? You can think of a probability density function as something that exists to be integrated. You find the probability of some event (set) by integrating a probability density over it. You find the expected value of some function by multiplying that function by a probability distribution and integrating. Likewise you could think of distributions in the sense of generalized functions as things that exist to be integrated. They act on test functions by being integrated against them, or by doing things analogous to integration that are more general.

People sometimes get confused because they look at probability densities outside of integrals and try to think of them is probabilities. They’re not. They are things you integrate to get probabilities. A probability density can, for example, be larger than 1, but a probability cannot. Likewise people sometimes get confused when they think of generalized functions on their own. If you give the generalized function something to act on, you’re more likely to be guided into doing the right thing. Distributions, whether probability distributions or generalized functions, act on other things.

Click to learn more about consulting help with signal processing


[1] The space of test functions can vary. The most common choice is infinitely differentiable functions with compact support. But for Fourier analysis, the natural space of test functions consists of infinitely differentiable functions of rapid decay, i.e. functions φ such that xn φ(x) goes to zero as x goes to ±∞ for any positive integer n.

The reason is that the Fourier transform of such a function is another function of the same kind. Test functions of compact support aren’t suited for Fourier analysis because a function with compact support cannot have a Fourier transform with compact support. It’s related to the Heisenberg uncertainty principle: the more concentrated something is in the time domain, the less concentrated it is in the frequency domain. A signal can’t be time-limited and bandlimited.

Typesetting and computing continued fractions


The other day I ran across the following continued fraction for π.

\pi = 3 + \cfrac{1^2}{6+\cfrac{3^2}{6+\cfrac{5^2}{6+\cfrac{7^2}{6+\cdots}}}}

Source: L. J. Lange, An Elegant Continued Fraction for π, The American Mathematical Monthly, Vol. 106, No. 5 (May, 1999), pp. 456-458.

While the continued fraction itself is interesting, I thought I’d use this an example of how to typeset and compute continued fractions.


I imagine there are LaTeX packages that make typesetting continued fractions easier, but the following brute force code worked fine for me:

    \pi = 3 + \cfrac{1^2}{6+\cfrac{3^2}{6+\cfrac{5^3}{6+\cfrac{7^2}{6+\cdots}}}}

This relies on the amsmath package for the \cfrac command.


Continued fractions of the form

\pi = a_0 + \cfrac{b_1}{a_1+\cfrac{b_2}{a_2 +\cfrac{b_3}{a_3+\cfrac{b_4}{a_4+\cdots}}}}

can be computed via the following recurrence. Define A-1 = 1, A0 = a0, B-1 = 0, and B0 = 1. Then for k ≥ 1 define Ak+1 and Bk+1 by

A_{k+1} = a_{k+1}A_k + b_k A_{k-1} \\ B_{k+1} = a_{k+1}B_k + b_k B_{k-1}

Then the nth convergent the continued fraction is Cn = An / Bn.

The following Python code creates the a and b coefficients for the continued fraction for π above then uses a loop that could be used to evaluate any continued fraction.

    N = 20
    a = [3] + ([6]*N)
    b = [(2*k+1)**2 for k in range(0,N)]
    A = [0]*(N+1)
    B = [0]*(N+1)

    A[-1] = 1
    A[ 0] = a[0]
    B[-1] = 0
    B[ 0] = 1

    for n in range(1, N+1):
        A[n] = a[n]*A[n-1] + b[n-1]*A[n-2]
        B[n] = a[n]*B[n-1] + b[n-1]*B[n-2]
        print( n, A[n], B[n], A[n]/B[n] )

Python uses -1 as a shortcut to the last index of a list. I tack A-1 and B-1 on to the end of the A and B arrays to make the Python code match the math notation. This is either clever or a dirty hack, depending on your perspective.

Back to pi

You may notice that these approximations for π are not particularly good. It’s a trade-off for having a simple pattern to the coefficients. The continued fraction for π that has all b‘s equal to 1 has a complicated set of a‘s with no discernible pattern: 3, 7, 15, 1, 292, 1, 1, etc. However, that continued fraction produces very good approximations. If you replace the first three lines of the code above with that below, you’ll see that four iterations produces an approximation to π good to 10 decimal places.

    N = 4
    a = [3, 7, 15, 1, 292]
    b = [1]*N

Timidity about approximating

“Nature does not consist entirely, or even largely, of problems designed by a Grand Examiner to come out neatly in finite terms, and whatever subject we tackle the first need is to overcome timidity about approximating.”

H. and B. S. Jeffreys, Methods of Mathematical Physics, 2nd ed., Cambridge University Press, 1950, p. 8.

Related post: Just an approximation

It all boils down to linear algebra

When I was in college, my view of applied math was something like the following.

Applied math is mostly mathematical physics. Mathematical physics is mostly differential equations. Numerical solution of differential equations boils down to linear algebra. Therefore the heart of applied math is linear algebra.

I still think there’s a lot of truth in the summary above. Linear algebra is very important, and a great deal of applied math does ultimately depend on efficient solutions of large linear systems. The weakest link in the argument may be the first one: there’s a lot more to applied math than mathematical physics. Mathematical physics hasn’t declined, but other areas have grown. Still, areas of applied math outside of mathematical physics and outside of differential equations often depend critically on linear algebra.

I’d certainly recommend that someone interested in applied math become familiar with numerical linear algebra. If you’re going to be an expert in differential equations, or optimization, or many other fields, you need to be at leas familiar with numerical linear algebra if you’re going to compute anything. As Stephen Boyd points out in his convex optimization class, many of the breakthroughs in optimization over the last 20 years have at their core breakthroughs in numerical linear algebra. Improved algorithms have sped up the solution of very large systems more than Moore’s law has.

It may seem questionable to say that linear algebra is at the heart of applied math because it’s linear. What about nonlinear applications, such as nonlinear PDEs? Nonlinear differential equations lead to nonlinear algebraic equations when discretized. But these nonlinear systems are solved via iterations of linear systems, so we’re back to linear algebra.

Click to find out more about consulting for numerical computing


Splitting proofs in two

“Ever since Euclid, mathematical proofs have served a dual purpose: certifying that a statement is true and explaining why it is true. In the future these two epistemological functions may be divorced. In the future, the computer assistant may take care of the certification and leave the mathematician to look for an explanation that humans can understand.”

Dana Mackenzie, “What in the Name of Euclid Is Going On Here?”, Science, 2005


Permutations and tests

Suppose a test asks you to place 10 events in chronological order. Label these events A through J so that chronological order is also alphabetical order.

If a student answers BACDEFGHIJ, then did they make two mistakes or just one? Two events are in the wrong position, but they made one transposition error. The simplest way to grade such a test would be to count the number of events that are in the correct position. Is this the most fair way to grade?

If you decide to count how many transpositions are needed to correct a student’s answer, do you count any transposition or only adjacent transpositions? For example, if someone answered JBCDEFGHIA, then transposing the A and the J is enough to put the results in order. But reversing the first and last event seems like a bigger mistake than reversing the first two events. Counting only adjacent transpositions would penalize this mistake more. You would have to swap the J with each of the eight letters between J and A. But it hardly seems that answering JBCDEFGHIA is eight times worse than answering BACDEFGHIJ.

Maybe counting transpositions is too much work. So we just go back to counting how many events are in the right place. But then suppose someone answers JABCDEFGHI. This is completely wrong since every event is in the wrong position. But the student obviously knows something, since the relative order of nearly all of the events is correct. From one perspective there was only one mistake: J comes last, not first.

What is the worst possible answer? Maybe getting the order exactly backward? If you have an odd number of events, then getting the order backward means one event is in the right place, and so that doesn’t receive the lowest possible score.

This is an interesting problem beyond grading exams. (As for grading exams, I’d suggest simply not using questions of this type on an exam.) In manufacturing, how serious a mistake is it to reverse two consecutive components versus two distant components? You could also ask the same question when comparing DNA sequences or other digital signals. The best way to assign a distance between the actual and desired sequence would depend entirely on context.

Fibonacci formula for pi

Here’s an unusual formula for pi based on the product and least common multiple of the first m Fibonacci numbers.


\pi = \lim_{m\to\infty} \sqrt{\frac{6 \log F_1 \cdots F_m}{\log \mbox{lcm}( F_1, \ldots, F_m )}}

Unlike the formula I wrote about a few days ago relating Fibonacci numbers and pi, this one is not as simple to prove. The numerator inside the root is easy enough to estimate asymptotically, but estimating the denominator depends on the distribution of primes.

Source: Yuri V. Matiyasevich and Richard K. Guy, A new formula for π, American Mathematical Monthly, Vol 93, No. 8 (October 1986), pp. 631-635.


Fibonacci numbers, arctangents, and pi

Here’s an unusual formula for π. Let Fn be the nth Fibonacci number. Then

\pi = 4 \sum_{n=1}^\infty \arctan\left( \frac{1}{F_{2n+1}} \right)

As mysterious as this equation may seem, it’s not hard to prove. The arctangent identity

\arctan\left(\frac{1}{F_{2n+1}}\right) = \arctan\left(\frac{1}{F_{2n}}\right) - \arctan\left(\frac{1}{F_{2n+2}}\right)

shows that the sum telescopes, leaving only the first term, arctan(1) = π/4. To prove the arctangent identity, take the tangent of both sides, use the addition law for tangents, and use the Fibonacci identity

F_{n+1} F_{n-1} - F_n^2 = (-1)^n

See this post for an even more remarkable formula relating Fibonacci numbers and π.

Number of digits in n!

The other day I ran across the fact that 23! has 23 digits. That made me wonder how often n! has n digits.

There can only be a finite number of cases, because n! grows faster than 10n for n > 10, and it’s reasonable to guess that 23 might be the largest case. Turns out it’s not, but it’s close. The only cases where n! has n digits are 1, 22, 23, and 24. Once you’ve found these by brute force, it’s not hard to show that they must be the only ones because of the growth rate of n!.

Is there a convenient way to find the number of digits in n! without having to compute n! itself? Sure. For starters, the number of digits in the base 10 representation of a number x is

⌊ log10 x ⌋ + 1.

where ⌊ z ⌋ is the floor of z, the largest integer less than or equal to z. The log of the factorial function is easier to compute than the factorial itself because it won’t overflow. You’re more likely to find a function to compute the log of the gamma function than the log of factorial, and more likely to find software that uses natural logs than logs base 10. So in Python, for example, you could compute the number of digits with this:

from scipy.special import gammaln
from math import log, floor

def digits_in_factorial(n):
    return floor( gammaln(n+1)/log(10.0) ) + 1

What about a more elementary formula, one that doesn’t use the gamma function? If you use Stirling’s approximation for factorial and take log of that you should at least get a good approximation. Here it is again in Python:

from math import log, floor, pi

def stirling(n):
    return floor( ((n+0.5)*log(n) - n + 0.5*log(2*pi))/log(10) ) + 1

The code above is exact for every n > 2 as far as I’ve tested, up to n = 1,000,000. (Note that one million factorial is an extremely large number. It has 5,565,709 digits. And yet we can easily say something about this number, namely how many digits it has!)

The code may break down somewhere because the error in Stirling’s approximation or the limitations of floating point arithmetic. Stirling’s approximation gets more accurate as n increases, but it’s conceivable that a factorial value could be so close to a power of 10 that the approximation error pushes it from one side of the power of 10 to the other. Maybe that’s not possible and someone could prove that it’s not possible.

You could extend the code above to optionally take another base besides 10.

def digits_in_factorial(n, b=10):
    return floor( gammaln(n+1)/log(b) ) + 1

def stirling(n, b=10):
    return floor( ((n+0.5)*log(n) - n + 0.5*log(2*pi))/log(b) ) + 1

The code using Stirling’s approximation still works for all n > 2, even for b as small as 2. This is slightly surprising since the number of bits in a number is more detailed information than the number of decimal digits.