New Twitter account for functional programming and categories

I’m starting a new Twitter account @FunctorFact for functional programming and category theory.

These two subjects have a lot of overlap, and some tweets will combine both, but many will be strictly about one or the other. So some content will be strictly about programming, some strictly about math, and some combining ideas from both.

FunctorFact icon

Prime factors, phone numbers, and the normal distribution

Telephone numbers typically have around three distinct prime factors.

The length of a phone number varies by country, but US a phone number is a 10 digit number, and 10-digit numbers are often divisible by three different prime numbers, give or take a couple. Assuming phone numbers are scattered among possible 10-digit numbers in a way that doesn’t bias their number of prime factors, these numbers will often have between 1 and 5 prime factors. If a country has 9- or 11-digit phone numbers, the result is essentially the same.

Let ω(n) be the number of distinct prime factors of n. Then the Erdős–Kac theorem says roughly that ω(n) is distributed like a normal random variable with mean and variance log log n. More precisely, fix two numbers a and b.  For a given value of x, count the proportion of positive integers less than x where

(ω(n) – log log n) / sqrt( log log n)

is between a and b. Then in the limit as x goes to infinity, this proportion approaches the probability that a standard normal random variable is between a and b.

So by that heuristic, the number of distinct prime factors of a 10-digit number is approximately normally distributed with mean and variance log log 10^11 = 3.232, and such a distribution is between 1 and 5 around 73% of the time.

My business phone number, for example, is 8324228646. Obviously this is divisible by 2. In fact it equals 2 × 32 × 462457147, and so it has exactly three distinct prime factors: 2, 3, and 462457147.

Here’s how you could play with this using Python.

    from sympy.ntheory import factorint

    def omega(n):
        return len(factorint(n))

I looked in SymPy and didn’t see an implementation of ω(n) directly, but it does have a function factorint that returns the prime factors of a number, along with their multiplicities, in a dictionary. So ω(n) is just the size of that dictionary.

I took the first 20 phone numbers in my contacts and ran them through omega and got results consistent with what you’d expect from the theory above. One was prime, and none had more than five factors.

Bar chart of umber of prime factors in a sample of phone numbers with heights [1, 3, 5, 8, 3]

Five lemma, ASCII art, and Unicode

A few days ago I wrote about creating ASCII art in Emacs using ditaa. Out of curiosity, I wanted to try making the Five Lemma diagram. [1]

The examples in the ditaa site all have arrows between boxes, but you don’t have to have boxes.

Here’s the ditaa source:

A₀ ---> A₁ ---> A₂ ---> A₃ ---> A₄
|       |       |       |       |            
| f₀    | f₁    | f₂    | f₃    | f₄    
|       |       |       |       |      
v       v       v       v       v      
B₀ ---> B₁ ---> B₂ ---> B₃ ---> B₄

and here’s the image it produces:

Five lemma diagram

It’s not pretty. You could make a nicer image with LaTeX. But as the old saying goes, the remarkable thing about a dancing bear is not that it dances well but that it dances at all.

The trick to getting the subscripts is to use Unicode characters 0x208n for subscript n. As I noted at the bottom of this post, ditaa isn’t strictly limited to ASCII art. You can use Unicode characters as well. You may or may not be able to see the subscripts in the source code they are not part of the most widely supported set of characters.

* * *

[1]  The Five Lemma is a diagram-chasing result from homological algebra. It lets you infer properties the middle function f from properties of the other f‘s.

Benford’s law, chi-square, and factorials

A while back I wrote about how the leading digits of factorials follow Benford’s law. That is, if you look at just the first digit of a sequence of factorials, they are not evenly distributed. Instead, 1’s are most popular, then 2’s, etc. Specifically, the proportion of factorials starting with n is roughly log10(1 + 1/n).

Someone has proved that the limiting distribution of leading digits of factorials exactly satisfies Benford’s law. But if we didn’t know this, we might use a chi-square statistic to measure how well the empirical results match expectations. As I argued in the previous post, statistical tests don’t apply here, but they can be useful anyway in giving us a way to measure the size of the deviations from theory.

Benford’s law makes a better illustration of the chi-square test than the example of prime remainders because the bins are unevenly sized, which they’re allowed to be in general. In the prime remainder post, they were all the same size.

The original post on leading digits of factorial explains why we compute the leading digits the way we do. Only one detail has changed: the original post used Python 2 and this one uses Python 3. Integer division was the default in Python 2, but now in Python 3 we have to use // to explicitly ask for integer division, floating point division being the new default.

Here’s a plot of the distribution of the leading digits for the first 500 factorials.

And here’s code to compute the chi-square statistic:

    from math import factorial, log10

    def leading_digit_int(n):
        while n > 9:
            n = n//10
        return n

    def chisq_stat(O, E):
        return sum( [(o - e)**2/e for (o, e) in zip(O, E)] )

    # Waste the 0th slot to make the code simpler.
    digits = [0]*10

    N = 500
    for i in range(N):
        digits[ leading_digit_int( factorial(i) ) ] += 1

    expected = [ N*log10(1 + 1/n) for n in range(1, 10) ]

    print( chisq_stat(digits[1:], expected) )

This gives a chi-square statistic of 7.693, very near the mean value of 8 for a chi-square distribution with eight degrees of freedom. (There are eight degrees of freedom, not nine, because if we know how many factorials start with the digits 1 through 8, we know how many start with 9.)

So the chi-square statistic suggests that the deviation from Benford’s law is just what we’d expect from random data following Benford’s law. And as we said before, this suggestion turns out to be correct.

Related posts:

Hypothesis testing and number theory

This post uses a hypothesis test for proportions to look at a couple conjectures in number theory. It is similar to my earlier post on the chi-square test and prime remainders. You could read this as a post on statistics or a post on number theory, depending on which you’re less familiar with.

Using statistical tests on number theory problems is kind of odd. There’s nothing random going on, so in that since the whole enterprise is unjustified. Nevertheless, statistical tests can be suggestive. They certainly don’t prove theorems, but they can give reason to suspect a theorem is true or false. In that sense, applying statistical tests to number theory isn’t all that different from applying them to more traditional settings.

First we’ll look at the remainders of primes modulo 4. Except for 2, all primes are odd, and so they either have remainder 1 or 3 when divided by 4. Brian Hayes wrote recently that Chebyshev noticed in the 1850’s that there seems to be more primes with remainder 3. Is the imbalance larger than one would expect to see from fair coin tosses?

Here’s some Python code to find the proportion of the first million primes (after 2) that have remainder 3 when divided by 4.

    from sympy import prime
    from scipy import sqrt

    n = 1000000
    rem3 = 0
    for i in range (2, n+2):
        if prime(i) % 4 == 3:
            rem3 += 1
    p_hat = rem3/n

This shows that of the first million odd primes, 500,202 are congruent to 3 mod 4. Would it be unusual for a coin to come up heads this many times in a million flips? To find out we’d compute the z-statistic:

z = \frac{\hat{p} - p}{\sqrt{pq/n}}

Here p is the proportion under the null hypothesis, q = 1 – p, and n is the sample size. In our case, the null hypothesis is p = 0.5 and n = 1,000,000. [1]

The code

    p = 0.5
    q = 1 - p
    z = (p_hat - p)/sqrt(p*q/n)

shows that z = 0.404, hardly a suspiciously large value. If we were looking at random values we’d see a z-score this large or larger 34% of the time. So in this case doesn’t suggest much.

* * *

[1] The derivation of the z statistic is fairly quick. If the proportion of successes is p, then the number of successes out of n trials is binomial(np). For large n, this is has approximately the same distribution as a normal distribution with the same mean and variance, mean np and variance npq. The proportion of successes then has approximately mean p and standard deviation √(pq/n). Subtracting the mean and dividing by the standard deviation normalizes the distribution to have mean 0 and variance 1. So under the null hypothesis, the z statistic has a standard normal distribution.

ASCII art diagrams in Emacs org-mode

Yesterday I wrote about ASCII art diagrams and gave four reasons you might want to use this ancient approach to creating simple diagrams:

  • It could be quicker than creating a graphical image .
  • You can paste them into plain text documents like source code files.
  • They can be produced programmatically.
  • There is software to turn ASCII art into more polished images.

Today I’ll post a few notes about how to create graphical versions of ASCII diagrams in Emacs with org-mode.

Running code inside org-mode

You can embed and execute source code in org-mode files. I wrote a couple posts about this, one showing how to run Python and R inside org-mode and another showing how to mix languages in org-mode. The latter shows Perl calling R and Python, all in 14 lines of code.

There are currently 39 programming languages that org-mode can call by default. In addition to conventional programming languages like the ones listed above, org-mode also supports ditaa, the language that treats ASCII art as a specification language to produce graphics.

You can edit code blocks just as you would other text in an org-mode file. But if you’d like to edit a code block in its language’s mode, type C-c ' from inside the code block. If you’re editing Python code, for example, this will open a new window, in Python mode, containing just that code block. If you type C-c ' inside a ditaa code block, Emacs opens a new window in “artist mode,” a mode specifically for editing ASCII art.

You can run code inside org-mode two ways: interactively and for export. With your cursor inside a block of code, type C-c C-c to execute the code and report the results. You can also export an entire org-mode document and have the results of code execution embedded in the final document. This works much the same as “*weave” projects like Sweave, Pweave, and Hweave. But while each of these is specific to a particular programming language (R, Python, and Haskell respectively), org-mode works with dozens of languages, including ditaa.

Running ditaa inside org-mode

You embed ditaa code just like you’d embed any other code. In the first post mentioned above, I gave this example of calling R:

    #+begin_src R
    sqrt(42)
    #+end_src

Here’s the analogous code for ditaa:

    #+BEGIN_SRC ditaa :file foo.png
    +-------+
    | Hello |
    +-------+
    #+END_SRC

The markers to begin and end a source code segment can be upper or lower case. I used lower case in my previous post, but it may be more readable to use upper case so that the markers stand out better from their arguments.

The R example above didn’t use any header arguments, though it could have. With ditaa, you must provide a header argument: the name of the file to save the graphics in.

If you run the ditaa code by navigating inside the code block and running C-c C-c, Emacs will add a couple lines after the code block:

    #+RESULTS:
    [[file:foo.png]]

This is the literal text, what you’d see if you opened the file in another editor. But org-mode uses the double brackets for links. You wouldn’t see the double brackets. Instead you’d see a hyperlink with text file:foo.png. Clicking on that link opens the image.

Hello image produce by ditaa

You can export the org-mode file with the command C-c C-e. This brings up a menu of export options: h for HTML, l for LaTeX, etc. Within each of these are further options. For example, you could type l for LaTeX and then type l again to select export to LaTeX or type p to have have org-mode run LaTeX on the file and produce a PDF. If you know you want a PDF, you could do this all in one command: C-c C-l l p.

You can control whether org-mode exports code or the results of running code (or both or neither). Org-mode exports the results of ditaa code, i.e. graphics files, by default. This makes sense: your PDF file will have a nice image version of your diagram, not the ASCII art used as its source code.

Configuration and troubleshooting

By default, the only programming language you can run inside org-mode is Emacs Lisp. This makes sense. You want to be deliberate about what code you run, but if you don’t want to run Emacs Lisp you’d better not run Emacs!

Inside your Emacs configuration file, you specify what languages org-mode is allowed to run. Here’s an example allowing Python and ditaa:

    
    (org-babel-do-load-languages 'org-babel-load-languages '(
        (python . t) 
        (ditaa . t))
    )

Recent versions of Emacs are supposed to ship with ditaa, but the ditaa jar file was missing from the two computers I experimented with. Emacs complained

    org-babel-execute:ditaa: Could not find ditaa.jar at ...

and so I copied ditaa.jar to the place where Emacs said it was looking for it. That worked, but it’s kind of a kludge because now I had two copies of the jar file. A better solution is to tell Emacs where you already have ditaa.

I like to use the exact same init.el file on every machine I use and so I added a section where I put OS-specific and machine-specific configuration. Here I put a different path to ditaa for each OS.

    ;;----------------------------------------------------------------------
    ;; OS specific
    
    (cond
        ((string-equal system-type "windows-nt") ; Microsoft Windows
    
            (progn
                (setq-default ispell-program-name "C:/bin/Aspell/bin/aspell.exe") 
                # etc.
                (setq org-ditaa-jar-path "c:/bin/ditaa/ditaa.jar")
            )
        )
        ((string-equal system-type "gnu/linux") ; Linux
            (progn
                (setq x-select-enable-clipboard t)
                 # etc.
                (setq org-ditaa-jar-path "/usr/bin/ditaa")
            )
        )
    )

ASCII art diagrams

“Technology is additive.” — Kevin Kelly

Old technologies never die. Instead, their range of application shrinks. Or maybe it grows when conditions change.

ASCII art, drawing pictures with fixed-width plain text characters, is no longer how many people want to produce diagrams. Just fire up Adobe Illustrator and you get incomparably more resolution of expression.

And yet there are times when ASCII art comes in handy. You can, for example, paste it into source code files. Someone more familiar with Emacs than Illustrator may be able to produce a simple diagram in the former faster than the latter. And it can be relatively easy to programmatically produce a large number of ASCII art diagrams, depending on the nature of the diagrams.

It’s also possible to use ASCII art as a way to specify nicely rendered images. I’ll show how to do that with ditaa below.

Here’s an ASCII version of the conjugate prior diagram I made some time ago:

 

                          +-------------+
                          |             |
                          | Exponential |
                          |             |
                          +-------------+
                                 |
                          lambda |                                                          
                                 v
+-------------+           +-------------+           +-------------+
|             |   tau     |             |   lambda  |             |
|  Lognormal  |---------->|    Gamma    |<----------|   Poisson   |
|             |           |             |---+       |             |
+-------------+           +-------------+   |       +-------------+
      |                          ^   ^      | beta
      |                          |   |      |
      |                          |   +------+
      |                      tau |           
      |                          |           
      |                   +-------------+   
      |        mu         |             |   
      +------------------>|    Normal   |
                          |             |----+
                          +-------------+    | 
                                     ^       | mu
                                     |       |
                                     +-------+

And here’s the image produced by ditaa processing the ASCII diagram above:

Conjugate prior diagram produced by ditaa

Update: See my next post on how to create ASCII art diagrams and their graphic version from ditaa using Emacs org mode.

Update: When I first made the diagram above, I tried using Greek letters, e.g. using β rather than “beta,” but this didn’t work. I thought “OK, I suppose it’s not really ASCII art if you use non-ASCII characters.” But someone told me Unicode characters worked for him, so I tried again when I wrote the follow up post and it worked.

My first attempt, from a Windows laptop, calling ditaa from the command line, did not work. My second attempt, running inside org-mode from a Windows desktop, did work. My third attempt, running from Emacs on Linux also worked.

ditaa diagram using Unicode symbols

Interview with Chris Toomey of Upcase

The other day I spoke to Chris Toomey from thoughtbot. Chris runs Upcase, thoughtbot’s online platform for learning about Rails, test-driven development, clean code, and more. I was curious about his work with Ruby on Rails since I know little about that world. And at a little deeper level, I wanted to get his thoughts on how programming languages are used in practice, static vs dynamic, strongly typed vs weakly typed, etc.

Chirs Toomey

JC: Chris, I know you do a lot of work with Ruby on Rails. What do you think of Ruby without Rails? Would you be as interested in Ruby if the Rails framework had been written in some other language?

CT: Let me back up a little bit and give you some of my background. I started out as an engineer and I used VB because it was what I had for the task at hand. Then when I decided to buckle down and become a real developer I chose Python because it seemed like the most engineering-oriented alternative. It seemed less like an enterprise language, more small and nimble. I chose Python over Ruby because of my engineering background. Python seemed more serious, while Ruby seemed more like a hipster language. Ruby sounded frivolous, but I kept hearing good things about it, especially with Rails. So like a lot of people I came to Ruby through Rails. It was the functionality and ease of use that got me hooked, but I do love Ruby as a language, the beauty and expressiveness of it. It reads more like prose than other languages. It’s designed for people rather than machines. But it’s also a very large language and hard to parse because of that. Over time though I’ve seen people abuse the looseness, the freedom in Ruby, and that’s caused me to look at stricter options like Haskell and other functional languages.

JC: I only looked at Ruby briefly, and when I saw the relative number of numerical libraries for Python and Ruby I thought “Well, looks like it’s Python for me.”

It seems like Ruby bears some resemblance to Perl, for better or worse.

CT: Absolutely. Ruby has two spiritual ancestors. One is Perl and the other is Smalltalk. I think both of those are great, and many of the things I love about Ruby come from that lineage. Perl contributed the get-things-done attitude, the looseness and terseness, the freedom to interact at any level of abstraction.

It’s kinda odd. I keep coming back to The Zen of Python. One of the things it says is that explicit is better than implicit, and I really think that’s true. And yet I work in Ruby and Rails where implicit is the name of the game. So I have some cognitive dissonance over that. I love Ruby on Rails, but I’m starting to look at other languages and frameworks to see if something else might fit as well.

JC: Do you have the freedom to choose what language and framework you work in? Do clients just ask for a web site, or do they dictate the technology?

CT: We have a mix. A lot of clients just want a web app, but some, especially large companies, want us to use their technology stack. So while we do a lot of Rails, we also do some Python, Haskell, etc.

JC: Do you do everything soup-to-nuts or do you have some specialization?

CT: We have three roles at thoughtbot: designer, web developer, and mobile developer. The designers might do some JavaScript, but they mostly focused on user experience, testing, and design.

JC: How do you keep everything straight? The most intimidating thing to me about web development is all the diverse tools in play: the language for your logic, JavaScript, CSS, HTML, SQL, etc.

CT: There’s definitely some of that, but we outsource some parts of the stack. We host applications on Heroku, giving them responsibility for platform management. They run on top of AWS so they handle all the scaling issues so we can focus on the code. We’ll deploy to other environments if our client insists, but our preference is to go with Heroku.

Similarly, Rails has a lot of functionality for the database layer, so we don’t write a lot of SQL by hand. We’re all knowledgeable of SQL, but we’re not DBA-level experts. We scale up on that as necessary, but we want to focus on the application.

JC: Shifting gears a little bit, how do you program differently in a dynamic language like Ruby than you would in a stricter language like C++? And is that a good thing?

CT: One thing about Ruby, and dynamic languages in general, is that testing becomes all the more critical. There are a lot of potential runtime errors you have to test for. Whereas with something like Haskell you can program a lot of your logic into the type system. Ruby lets you work more freely, but Haskell leads to more robust applications. Some of our internal software at thoughtbot is written in Haskell.

JC: I was excited about using Haskell, but when I used it on a production project I ran into a lot of frustrations that you wouldn’t anticipate from working with Haskell in the small.

CT: Haskell does seem to have a more aggressive learning curve than other languages. There’s a lot of Academia in it, and in a way that’s good. The language hasn’t compromised its vision, and it’s been able to really develop some things thoroughly. But it also has a kind of academic heaviness to it.

There’s a language out there called Elm that’s inspired by Haskell and the whole ML family of languages that compiles down to JavaScript. It presents a friendlier interface to the whole type-driven, functional way of thinking. The developers of the language have put a lot of effort into making it approachable, without having to understand comonads and functors and all that.

JC: My difficulties with Haskell weren’t the theory but things like the lack of tooling and above all the difficulty of package management.

CT: Cabal Hell.

JC: Right.

CT: My understanding is that that’s improved dramatically with new technologies like Stack. We’re scaling up internally on Haskell. That’s the next area we’d like to get into. I’ll be able to say more about that down the road.

* * *

Check out Upcase for training materials on tools like Vim, Git, Rails, and Tmux.

Insertion sort as a fold

I’ve written several posts lately about various algorithms that can be expressed as functional folds:

These have all been numerical algorithms. Insertion sort is an example of a non-numerical algorithm that could be implemented as a fold.

Insertion sort is not the fastest sorting algorithm. It takes O(n2) operations to sort a list of size n while the fastest algorithms take O(n log n) operations. But insertion sort is simple, and as it works its way down a list, the portion it has processed is sorted. If you interrupt it, you have correct results given the input so far. If you interrupt something like a quicksort, you don’t have a sorted list. If you’re receiving a stream of data points and want to sort them as they come, you have to use insertion sort rather than something like quicksort.

The function to fold over a list looks like this:

    f as a = [x | x <- as, x < a] ++ [a] ++ [x | x <- as, x >= a]

Given a sorted list as and a value a, return a new sorted list that has inserted the element a in the proper position. Our function f does this by joining together the list of the elements less than a, the list containing only a, and the list of elements at least as big as a.

Here’s how we could use this to alphabetize the letters in the word “example.”

    foldl f [] "example"

This returns "aeelmpx".

Haskell takes our function f and an empty list of characters [] and returns a new list of characters by folding f over the list of characters making up the string "example".

You can always watch how foldl works by replacing it with scanl to see intermediate results.

    scanl f [] "example"

returns

    ["", "e", "ex", "aex", "aemx", "aempx", "aelmpx", "aeelmpx"]

Prime remainders too evenly distributed

First Brian Hayes wrote an excellent post about the remainders when primes are divided by other primes. Then I wrote a follow-on just focusing on the first part of his post. He mostly looked at pairs of primes, but I wanted to look in more detail at the first part of his post, simulating dice rolls by keeping the remainder when consecutive primes are divided by a fixed prime. For example, using a sequence of primes larger than 7 and taking their remainder by 7 to create values from 1 to 6.

The results are evenly distributed with some variation, just like dice rolls. In fact, given these results and results from a set of real dice rolls, most people would probably think the former are real because they’re more evenly distributed. A chi-squared goodness of fit test shows that the results are too evenly distributed compared to real dice rolls.

At the end of my previous post, I very briefly discuss what happens when you look a “dice” with more than six sides. Here I’ll go into a little more detail and look at a large number of examples.

In short, you either get a suspiciously good fit or a terrible fit. If you look at the remainder when dividing primes by m, you get values between 1 and m-1. You can’t get a remainder of 0 because primes aren’t divisible by m (or anything else!). If m itself is prime, then you get all the numbers between 1 and m-1, and as we’ll show below you get them in very even proportions. But if m isn’t prime, there are some remainders you can’t get.

The sequence of remainders looks random in the sense of being unpredictable. (Of course it is predictable by the algorithm that generates them, but it’s not predictable in the sense that you could look at the sequence out of context and guess what’s coming next.) The sequence is biased, and that’s the big news. Pairs of consecutive primes have correlated remainders. But I’m just interested in showing a different departure from a uniform distribution, namely that the results are too evenly distributed compared to random sequences.

The table below gives the chi-square statistic and p-value for each of several primes. For each prime p, we take remainders mod p of the next million primes after p and compute the chi-square goodness of fit statistic with p-2 degrees of freedom. (Why p-2? There are p-1 different remainders, and the chi-square test for k possibilities has k-1 degrees of freedom.)

The p-value column gives the probability of seeing at fit this good or better from uniform random data. (The p in p-value is unrelated to our use of p to denote a prime. It’s an unfortunate convention of statistics that everything is denoted p.) After the first few primes, the p-values are extremely small, indicating that such an even distribution of values would be astonishing from random data.

|-------+------------+------------|
| Prime | Chi-square | p-value    |
|-------+------------+------------|
|     3 |     0.0585 |   2.88e-02 |
|     5 |     0.0660 |   5.32e-04 |
|     7 |     0.0186 |   1.32e-07 |
|    11 |     0.2468 |   2.15e-07 |
|    13 |     0.3934 |   6.79e-08 |
|    17 |     0.5633 |   7.64e-10 |
|    19 |     1.3127 |   3.45e-08 |
|    23 |     1.1351 |   2.93e-11 |
|    29 |     1.9740 |   3.80e-12 |
|    31 |     2.0052 |   3.11e-13 |
|    37 |     2.5586 |   3.92e-15 |
|    41 |     3.1821 |   9.78e-16 |
|    43 |     4.4765 |   5.17e-14 |
|    47 |     3.7142 |   9.97e-18 |
|    53 |     3.7043 |   3.80e-21 |
|    59 |     7.0134 |   2.43e-17 |
|    61 |     5.1461 |   6.45e-22 |
|    67 |     7.1037 |   5.38e-21 |
|    71 |     7.6626 |   6.13e-22 |
|    73 |     7.5545 |   4.11e-23 |
|    79 |     8.0275 |   3.40e-25 |
|    83 |    12.1233 |   9.92e-21 |
|    89 |    11.4111 |   2.71e-24 |
|    97 |    12.4057 |   2.06e-26 |
|   101 |    11.8201 |   3.82e-29 |
|   103 |    14.4733 |   3.69e-26 |
|   107 |    13.8520 |   9.24e-29 |
|   109 |    16.7674 |   8.56e-26 |
|   113 |    15.0897 |   1.20e-29 |
|   127 |    16.4376 |   6.69e-34 |
|   131 |    19.2023 |   6.80e-32 |
|   137 |    19.1728 |   1.81e-34 |
|   139 |    22.2992 |   1.82e-31 |
|   149 |    22.8107 |   6.67e-35 |
|   151 |    22.8993 |   1.29e-35 |
|   157 |    30.1726 |   2.60e-30 |
|   163 |    26.5702 |   3.43e-36 |
|   167 |    28.9628 |   3.49e-35 |
|   173 |    31.5647 |   7.78e-35 |
|   179 |    33.3494 |   2.46e-35 |
|   181 |    36.3610 |   2.47e-33 |
|   191 |    29.1131 |   1.68e-44 |
|   193 |    29.9492 |   2.55e-44 |
|   197 |    34.2279 |   3.49e-41 |
|   199 |    36.7055 |   1.79e-39 |
|   211 |    41.0392 |   8.42e-40 |
|   223 |    39.6699 |   1.73e-45 |
|   227 |    42.3420 |   2.26e-44 |
|   229 |    37.1896 |   2.02e-50 |
|   233 |    45.0111 |   4.50e-44 |
|   239 |    43.8145 |   2.27e-47 |
|   241 |    51.3011 |   1.69e-41 |
|   251 |    47.8670 |   6.28e-48 |
|   257 |    44.4022 |   1.54e-53 |
|   263 |    51.5905 |   7.50e-49 |
|   269 |    59.8398 |   3.92e-44 |
|   271 |    59.6326 |   6.02e-45 |
|   277 |    52.2383 |   2.80e-53 |
|   281 |    52.4748 |   1.63e-54 |
|   283 |    64.4001 |   2.86e-45 |
|   293 |    59.7095 |   2.59e-52 |
|   307 |    65.2644 |   1.64e-52 |
|   311 |    63.1488 |   1.26e-55 |
|   313 |    68.6085 |   7.07e-52 |
|   317 |    63.4099 |   1.72e-57 |
|   331 |    66.3142 |   7.20e-60 |
|   337 |    70.2918 |   1.38e-58 |
|   347 |    71.3334 |   3.83e-61 |
|   349 |    75.8101 |   3.38e-58 |
|   353 |    74.7747 |   2.33e-60 |
|   359 |    80.8957 |   1.35e-57 |
|   367 |    88.7827 |   1.63e-54 |
|   373 |    92.5027 |   7.32e-54 |
|   379 |    86.4056 |   5.67e-60 |
|   383 |    74.2349 |   3.13e-71 |
|   389 |   101.7328 |   9.20e-53 |
|   397 |    86.9403 |   1.96e-65 |
|   401 |    90.3736 |   3.90e-64 |
|   409 |    92.3426 |   2.93e-65 |
|   419 |    95.9756 |   8.42e-66 |
|   421 |    91.1197 |   3.95e-70 |
|   431 |   100.3389 |   1.79e-66 |
|   433 |    95.7909 |   1.77e-70 |
|   439 |    96.2274 |   4.09e-72 |
|   443 |   103.6848 |   6.96e-68 |
|   449 |   105.2126 |   1.07e-68 |
|   457 |   111.9310 |   1.49e-66 |
|   461 |   106.1544 |   7.96e-72 |
|   463 |   116.3193 |   1.74e-65 |
|   467 |   116.2824 |   1.02e-66 |
|   479 |   104.2246 |   3.92e-79 |
|   487 |   116.4034 |   9.12e-73 |
|   491 |   127.2121 |   6.69e-67 |
|   499 |   130.9234 |   5.90e-67 |
|   503 |   118.4955 |   2.60e-76 |
|   509 |   130.9212 |   6.91e-70 |
|   521 |   118.6699 |   6.61e-82 |
|   523 |   135.4400 |   3.43e-71 |
|   541 |   135.9210 |   3.13e-76 |
|   547 |   120.0327 |   2.41e-89 |
|-------+------------+------------|

Computing higher moments with a fold

Folds in functional programming are often introduced as a way to find the sum or product of items in a list. In this case the fold state has the same type as the list items. But more generally the fold state could have a different type, and this allows more interesting applications of folds. Previous posts look at using folds to update conjugate Bayesian models and numerically solve differential equations.

This post uses a fold to compute mean, variance, skewness, and kurtosis. See this earlier post for an object-oriented approach. The code below seems cryptic out of context. The object-oriented post gives references for where these algorithms are developed. The important point for this post is that we can compute mean, variance, skewness, and kurtosis all in one pass through the data even though textbook definitions appear to require at least two passes. It’s also worth noting that the functional version is less than half as much code as the object-oriented version.

(Algorithms that work in one pass through a stream of data, updating for each new input, are sometimes called “online” algorithms. This is unfortunate now that “online” has come to mean something else.)

The Haskell function moments below returns the number of samples and the mean, but does not directly return variance, skewness and kurtosis. Instead it returns moments from which these statistics can easily be calculated using the mvks function.

    moments (n, m1, m2, m3, m4) x = (n', m1', m2', m3', m4')
        where
            n' = n + 1
            delta = x - m1
            delta_n = delta / n'
            delta_n2 = delta_n**2
            t = delta*delta_n*n
            m1' = m1 + delta_n
            m4' = m4 + t*delta_n2*(n'*n' - 3*n' + 3) + 6*delta_n2*m2 - 4*delta_n*m3
            m3' = m3 + t*delta_n*(n' - 2) - 3*delta_n*m2
            m2' = m2 + t

    mvsk (n, m1, m2, m3, m4) = (m1, m2/(n-1.0), (sqrt n)*m3/m2**1.5, n*m4/m2**2 - 3.0)                         

Here’s an example of how you would use this Haskell code to compute statistics for the list [2, 30, 51, 72]:

    ghci>  mvsk $ foldl moments (0,0,0,0,0) [2, 30, 51, 72]
    (38.75, 894.25,-0.1685, -1.2912)

The foldl applies moments first to its initial value, the 5-tuple of zeros. Then it iterates over the data, taking data points one at a time and visiting each point only once, returning a new state from moments each time. Another way to say this is that after processing each data point, moments returns the 5-tuple that it would have returned if that data only consisted of the values up to that point.

For a non-numerical example of folds, see my post on sorting.

Chi-square goodness of fit test example with primes

chi squared

Yesterday Brian Hayes wrote a post about the distribution of primes. He showed how you could take the remainder when primes are divided by 7 and produce something that looks like rolls of six-sided dice. Here we apply the chi-square goodness of fit test to show that the rolls are too evenly distributed to mimic randomness. This post does not assume you’ve seen the chi-square test before, so it serves as an introduction to this goodness of fit test.

In Brian Hayes’ post, he looks at the remainder when consecutive primes are divided by 7, starting with 11. Why 11? Because it’s the smallest prime bigger than 7. Since no prime is divisible by any other prime, all the primes after 7 will have a remainder of between 1 and 6 inclusive when divided by 7. So the results are analogous to rolling six-sided dice.

The following Python code looks at prime remainders and (pseudo)random rolls of dice and computes the chi-square statistic for both.

First, we import some functions we’ll need.

    from sympy import prime
    from random import random
    from math import ceil

The function prime takes an argument n and returns the nth prime. The function random produces a pseudorandom number between 0 and 1. The ceiling function ceil rounds its argument up to an integer. We’ll use it to convert the output of random into dice rolls.

In this example we’ll use six-sided dice, but you could change num_sides to simulate other kinds of dice. With six-sided dice, we divide by 7, and we start our primes with the fifth prime, 11.

    num_sides = 6
    modulus = num_sides + 1

    # Find the index of the smallest prime bigger than num_sides
    index = 1
    while prime(index) <= modulus:
        index += 1

We’re going to take a million samples and count how many times we see 1, 2, …, 6. We’ll keep track of our results in an array of length 7, wasting a little bit of space since the 0th slot will always be 0. (Because the remainder when dividing a prime by a smaller number is always positive.)

    # Number of samples
    N = 1000000
    
    observed_primes = [0]*modulus
    observed_random = [0]*modulus

Next we “roll” our dice two ways, using prime remainders and using a pseudorandom number generator.

    for i in range(index, N+index):
        m = prime(i) % modulus
        observed_primes[m] += 1
        m = int(ceil(random()*num_sides))
        observed_random[m] += 1

The chi-square goodness of fit test depends on the observed number of events in each cell and the expected number. We expect 1/6th of the rolls to land in cell 1, 2, …, 6 for both the primes and the random numbers. But in a general application of the chi-square test, you could have a different expected number of observations in each cell.

    expected = [N/num_sides for i in range(1, modulus)]

The chi-square test statistic sums (O – E)2/E over all cells, where O stands for “observed” and E stands for “expected.”

    def chisq_stat(O, E):
        return sum( [(o - e)**2/e for (o, e) in zip(O, E)] )

Finally, we compute the chi-square statistic for both methods.

    ch = chisq_stat(observed_primes[1:], expected[1:])
    print(ch)

    ch = chisq_stat(observed_random[1:], expected[1:])
    print(ch)

Note that we chop off the first element of the observed and expected lists to get rid of the 0th element that we didn’t use.

When I ran this I got 0.01865 for the prime method and 5.0243 for the random method. Your results for the prime method should be the same, though you might have a different result for the random method.

Now, how do we interpret these results? Since we have six possible outcomes, our test statistics has a chi-square distribution with five degrees of freedom. It’s one less than the number of possibilities because the total counts have to sum to N; if you know how many times 1, 2, 3, 4, and 5 came up, you can calculate how many times 6 came up.

A chi-square distribution with ν degrees of freedom has expected value ν. In our case, we expect a value around 5, and so the chi-square value of 5.0243 is unremarkable. But the value of 0.01864 is remarkably small. A large chi-square statistics would indicate a poor fit, the observed numbers being suspiciously far from their expected values. But a small chi-square value suggests the fit is suspiciously good, closer to the expected values than we’d expect of a random process.

We can be precise about how common or unusual a chi-square statistic is by computing the probability that a sample from the chi square distribution would be larger or smaller. The cdf gives the probability of seeing a value this small or smaller, i.e. a fit this good or better. The sf gives the probability of seeing a value this larger or larger, i.e. a fit this bad or worse. (The scipy library uses sf for “survival function,” another name for the ccdf, complementary cumulative distribution function).

    from scipy.stats import chi2
    print(chi2.cdf(ch, num_sides-1), chi2.sf(ch, num_sides-1))

This says that for the random rolls, there’s about a 41% chance of seeing a better fit and a 59% chance of seeing a worse fit. Unremarkable.

But it says there’s only a 2.5 in a million chance of seeing a better fit than we get with prime numbers. The fit is suspiciously good. In a sense this is not surprising: prime numbers are not random! And yet in another sense it is surprising since there’s a heuristic that says primes act like random numbers unless there’s a good reason why in some context they don’t. This departure from randomness is the subject of research published just this year.

If you look at dice with 4 or 12 sides, you get a suspiciously good fit, but not as suspicious as with 6 sides. But with 8 or 20-sided dice you get a very bad fit, so bad that its probability underflows to 0. This is because the corresponding moduli, 9 and 21, are composite, which means some of the cells in our chi-square test will have no observations. (Suppose m has a proper factor a. Then if a prime p were congruent to a mod m, p would be have to be divisible by a.)

Update: See the next post for a systematic look at different moduli.

You don’t have to use “dice” that correspond to regular solids. You could consider 10-sided “dice,” for example. For such numbers it may be easier to think of spinners than dice, a spinner with 10 equal arc segments it could fall into.

Related post: Probability that a number is prime

Grateful for failures

I’ve been thinking lately about different things I’ve tried that didn’t work out and how grateful I am that they did not.

The first one that comes to mind is my academic career. If I’d been more successful with grants and publications as a postdoc, it would have been harder to decide to leave academia. I’m glad I left when I did.

When I was in high school I was a fairly good musician. At one point decided that if I made the all state band I would major in music. Thank God I didn’t make it.

I’ve looked back at projects that I hoped to get, and then realized how it’s a good thing that they didn’t come through.

In each of these examples, I’ve been forced to turn away from something I was moderately good at to pursue something that’s a better fit for me.

I wonder what failure I’ll be grateful for next.

 

Ten spectral graph theory posts

Erdős-Rényi graph

Here are 10 blog posts I wrote earlier this year about spectral graph theory, studying graphs via the eigenvalues of matrices associated with the graphs.