Category theory for data science: cautious optimism

I’m cautiously optimistic about applications of category theory. I believe category theory has the potential to be useful, but I’m skeptical of most claims to have found useful applications. Category theory has found useful application, especially behind the scenes, but a lot of supposed applications remind me of a line from Colin McLarty:

[Jean-Pierre] Serre created a series of concise elegant tools which Grothendieck and coworkers simplified into thousands of pages of category theory.

To be fair, Grothendieck put these recast tools to good use, but most mere mortals would stand a better chance of being able to use Serre’s versions.

My interest in category theory waxes and wanes, and just as it was was at its thinnest crescent phase I ran across CQL, categorical query language. I haven’t had time to look very far into it, but it seems promising. The site’s modest prose relative to the revolutionary rhetoric of some category enthusiasts makes me have more confidence that the authors may be on to something useful.

Related post: Categorical (Data Analysis) vs (Categorical Data) Analysis

Software to factor integers

In my previous post, I showed how changing one bit of a semiprime (i.e. the product of two primes) creates an integer that can be factored much faster. I started writing that post using Python with SymPy, but moved to Mathematica because factoring took too long.

SymPy vs Mathematica

When I’m working in Python, SymPy lets me stay in Python. I’ll often use SymPy for a task that Mathematica could do better just so I can stay in one environment. But sometimes efficiency is a problem.

SymPy is written in pure Python, for better and for worse. When it comes to factoring large integers, it’s for worse. I tried factoring a 140-bit integer with SymPy, and killed the process after over an hour. Mathematica factored the same integer in 1/3 of a second.

Mathematica vs PARI/GP

The previous post factors 200-bit semiprimes. The first example, N = pq where

p = 1078376712338123201911958185123
q = 1126171711601272883728179081277

took 99.94 seconds to factor using Mathematica. A random sample of 13 products of 100-bit primes and they took an average of 99.1 seconds to factor.

Using PARI/GP, factoring the value of N above took 11.4 seconds to factor. I then generated a sample of 10 products of 100-bit primes and on average they took 10.4 seconds to factor using PARI/GP.

So in these examples, Mathematica is several orders of magnitude faster than SymPy, and PARI/GP is one order of magnitude faster than Mathematica.

It could be that the PARI/GP algorithms are relatively better at factoring semiprimes. To compare the efficiency of PARI/GP and Mathematica on non-semiprimes, I repeated the exercise in the previous post, flipping each bit of N one at a time and factoring.

This took 240.3 seconds with PARI/GP. The same code in Mathematica took 994.5 seconds. So in this example, PARI/GP is about 4 times faster where as for semiprimes it was 10 times faster.

Python and PARI

There is a Python interface to PARI called cypari2. It should offer the convenience of working in Python with the efficiency of PARI. Unfortunately, the installation failed on my computer. I think SageMath interfaces Python to PARI but I haven’t tried it.

Related posts

Making public keys factorable with Rowhammer

The security of RSA encryption depends on the fact that the product of two large primes is difficult to factor. So if p and q are large primes, say 2048 bits each, then you can publish n = pq with little fear that someone can factor n to recover p and q.

But if you can change n by a tiny amount, you may make it much easier to factor. The Rowhammer attack does this by causing DRAM memory to flip bits. Note that we’re not talking about breaking someone’s encryption in the usual sense. We’re talking about secretly changing their encryption to a system we can break.

To illustrate on a small scale what a difference changing one bit can make, let p = 251 and q = 643.  Then n = pq = 161393. If we flip the last bit of n we get m = 161392. Although n is hard to factor by hand because it has no small factors, m is easy to factor, and in fact

161392 = 24 × 7 × 11 × 131.

For a larger example, I generated two 100-bit random primes in Mathematica

p = 1078376712338123201911958185123
q = 1126171711601272883728179081277

and was able to have it factor n = pq in about 100 seconds. But Mathematica was able to factor n-1 in a third of a second.

So far we have looked at flipping the least significant bit. But Rowhammer isn’t that precise. It might flip some other bit.

If you flip any bit of a product of two large primes, you’re likely to get an easier factoring problem, though the difficulty depends on the number you start with and which bit you flip. To illustrate this, I flipped each of the bits one at a time and measured how long it took to factor the result.

The median time to factor n with one bit flipped was 0.4 seconds. Here’s a plot of the factoring times as a function of which bit was flipped.

factoring times for corrupted product of primes

The plot shows about 80% of the data. Twenty percent of the time the value was above 11 seconds, and the maximum value was 74 seconds. So in every case flipping one bit made the factorization easier, usually quite a lot easier, but only a little easier in the worst case.

To verify that the results above were typical, I did a second experiment. This time I generated a sequence of pairs of random 100-bit primes. I factored their product, then factored the product with a randomly chosen bit flipped. Here are the factoring times in seconds.

    |----------+---------|
    | Original | Flipped |
    |----------+---------|
    |  117.563 |   3.828 |
    |  108.672 |   4.875 |
    |   99.641 |   0.422 |
    |  103.031 |   0.000 |
    |   99.188 |   0.000 |
    |  102.453 |   0.234 |
    |   79.594 |   0.094 |
    |   91.031 |   0.875 |
    |   64.313 |   0.000 |
    |   95.719 |   6.500 |
    |  131.125 |   0.375 |
    |   97.219 |   0.000 |
    |   98.828 |   0.203 |
    |----------+---------|

By the way, we started this post by talking about maliciously flipping a bit. The same thing can happen without a planned attack if memory is hit by a cosmic ray.

Related posts

Bounds on the nth prime

The nth prime is approximately n log n.

For more precise estimates, there are numerous upper and lower bounds for the nth prime, each tighter over some intervals than others. Here I want to point out upper and lower bounds from a dissertation by Christian Axler on page viii.

First, define

\begin{align*} f(n, k) &= \log n + \log \log n - 1 + \frac{\log \log n - 2}{\log n} \\ &\qquad - \frac{(\log\log n)^2 - 6\log\log n + k}{2\log^2 n} \end{align*}

Then for sufficiently large n, the nth prime number, pn, is bounded above and below by

n f(n, 11.847) <\, p_n < n f(n, 10.273)

The lower bound holds for n ≥ 2, and the upper bound holds for n ≥ 8,009,824.

For example, suppose we want to estimate the 10 millionth prime. The exact value is 179,424,673. The bounds we get from the equations above are 179,408,030 and 179,438,842.

The width of the bracket bounding pn is 0.787 n / log²n. In our example, the difference between the upper and lower bounds is 30,812.

This width, relative to pn, decreases something like O(1/log³ n). In our example, the width of the bounding interval is 0.017% of pn.

Converting between nines and sigmas

Nines and sigmas are two ways to measure quality. You’ll hear something has four or five nines of reliability or that some failure is a five sigma event. What do these mean, and how do you convert between them?

Definitions

If a system has fives nines of availability, that means the probability of the system being up is 99.999% Equivalently, the probability of it being down is 0.00001.

\underbrace{\mbox{99.999}}_{\mbox{{\normalsize five 9's}}}\%

In general, n nines of availability means the probability of failure is 10n.

If a system has s sigmas of reliability, that means the probability of failure is the same as the probability of a Gaussian random variable being s standard deviations above its mean [1].

Conversion formulas

Let Φ be the cumulative density function for a standard normal, i.e. a Gaussian random variable with mean zero and standard deviation 1. Then s sigmas corresponds to n nines, where

n = -log10(Φ(-s))

and

s = -Φ-1(10n).

We’ll give approximate formulas in just a second that don’t involve Φ but just use functions on a basic calculator.

Here’s a plot showing the relationship between nines and sigmas.

The plot looks a lot like a quadratic, and in fact if we take the square root we get a plot that’s visually indistinguishable from a straight line. This leads to very good approximations

n = (0.47 + 0.42 s

and

s = 2.37 √n – 1.12.

The approximation is so good that it’s hard to see the difference between it and the exact value in a plot.

These approximations are more adequate since nines and sigmas are crude measures, not accurate to more than one significant figure [2].

Related posts

[1] Here I’m considering the one-tailed case, the probability of being so many standard deviations above the mean. You could consider the two-tailed version, where you look at the probability of being so many standard deviations above or below the mean. The two-tailed probability is simply twice the one-tailed probability by symmetry.

[2] As I’ve written elsewhere, I’m skeptical of the implicit normal distribution assumption, particularly for rare events. The normal distribution is often a good modeling assumption in the middle, but not so often in the tails. Going out as far as six sigmas is dubious, and so the plot above covers as much range as is practical and then some.

Maybe it’s just hard

If someone tells you repeatedly that something isn’t hard, maybe it’s just hard.

Monads

A post by Gilad Bracha got me thinking about this. He says

Last time I looked, the Haskell wiki listed 29 tutorials on [monads]. … Could it just be that people just have a hard time understanding monads? If so, what are the prospects of mass adoption?

Monads are not that difficult, but when you have hundreds of tutorials (besides the ones on the Wiki mentioned above) telling you that monads are not hard, maybe they’re hard, at least hard enough for enough people that they’ll never see mass adoption.

I imagine that if something really were much easier if you just look at it the right way, there wouldn’t be that many publications saying so. One or at most a small number of publications would be popular. For example, Mendeleev noticed that the properties of chemical elements were easier to think about if you arranged the elements in a particular tabular form, and everyone agreed, and Mendeleev’s approach won.

The fact that there are hundreds of monad tutorials suggests that many people have independently had an epiphany about monads and wanted to blog about it. I don’t disparage this at all. Many of my blog posts come about this same way: I’ll understand something and write about it while the inspiration is fresh, for my own future reference as well as for the benefit of others.

I also don’t want to disparage monads. They can be very useful. For example, I know of someone who saved his company a lot of money by consolidating a menagerie of inconsistent projects and imposing a monadic framework to make everyone play nicely together. However—and I think this is telling—he was careful to not use the word “monad.” As I’ve written about before, category theory concepts like monads are often most useful behind the scenes.

Regular expressions

This post is the background for my previous post on regular expressions. I didn’t want to say regular expressions aren’t hard; a lot of people clearly do find them hard. Instead, I wanted to focus on why they’re hard, and give my perspective on perceived reasons and more fundamental reasons.

Unicode

Unicode is similar to regular expressions in that some people find it trivial and some find it maddeningly difficult. Both are right. Unicode can be trivial. The vast majority of pages on the internet use Unicode, specifically UTF-8 encoding. You’re reading Unicode right now. What’s so hard about that?

At its core, Unicode simply assigns numbers to characters, just as ASCII did. In a sense, Unicode is trivial. But it attempts to capture all human writing systems, and writing systems are complicated. As with regular expressions, it’s the peripheral issues that bring the complexity. Unicode is subtle because human language is subtle.

Why are regular expressions difficult?

Regular expressions are challenging, but not for the reasons commonly given.

Non-reasons

Here are some reasons given for the difficulty of regular expressions that I don’t agree with.

Cryptic syntax

I think complaints about cryptic syntax miss the mark. Some people say that Greek is hard to learn because it uses a different alphabet. If that were the only difficulty, you could easily learn Greek in a couple days. No, Greek is difficult for English speakers to learn because it is a very different language than English. The differences go much deeper than the alphabet, and in fact that alphabets are not entirely different.

The basic symbol choices for regular expressions — . to match any character, ? to denote that something is optional, etc. — were arbitrary, but any choice would be. As it is, the chosen symbols are sometimes mnemonic, or at least consistent with notation conventions in other areas.

Density

Regular expressions are dense. This makes them hard to read, but not in proportion to the information they carry. Certainly 100 characters of regular expression syntax is harder to read than 100 consecutive characters of ordinary prose or 100 characters of C code. But if a typical pattern described by 100 characters of regular expression were expanded into English prose or C code, the result would be hard to read as well, not because it is dense but because it is verbose.

Crafting expressions

The heart of using regular expressions is looking at a pattern and crafting a regular expression to match that pattern. I don’t think this is difficult, especially when you have some error tolerance. Very often in applications of regular expressions it’s OK to have a few false positives, as long as a human scans the output. For example, see this post on looking for ICD-10 codes.

I suspect that many people who think that writing regular expressions is difficult actually find some peripheral issue difficult, not the core activity of describing patterns per se.

Reasons

Now for what I believe are reasons why regular expressions are.

Overloaded syntax and context

Regular expressions use a small set of symbols, and so some of these symbols to double duty. For example, symbols take on different meanings inside and outside of character classes. (See point #4 here.) Extensions to the basic syntax are worse. People wanting to add new features to regular expressions ‐ look-behind, comments, named matches, etc. — had to come up with syntax that wouldn’t conflict with earlier usage, which meant strange combinations of symbols that would have previously been illegal.

Dialects

If you use regular expressions in multiple programming languages, you’ll run into numerous slight variations. Can I write \d for a digit, or do I need to write [0-9]? If I want to group a subexpression, do I need to put a backslash in front of the parentheses? Can I write non-capturing groups?

These variations are difficult to remember, not because they’re completely different, but because they’re so similar. It reminds me of my French teacher saying “Does literature have a double t in English and one t in French, or the other way around? I can never remember.”

Use

It’s difficult to remember the variations on expression syntax in various programming languages, but I find it even more difficult to remember how to use the expressions. If you want to replace all instances of some regular expression with a string, the way to do that could be completely different in two languages, even if the languages use the exact same dialect of regular expressions.

Resources

Here are notes on regular expressions I’ve written over the years, largely for my own reference.

Feller-Tornier constant

Here’s kind of an unusual question: What is the density of integers that have an even number of prime factors with an exponent greater than 1?

To define the density, you take the proportion up to an integer N then take the limit as N goes to infinity.

It’s not obvious that the limit should exist. But if it does exist, you might guess that it’s 1/2; it’s a question involving whether something is even or odd, and so even and odd occurrences might balance each other out.

But the result, known as the Feller-Tonier constant, is bigger than 1/2. This seems more reasonable when you consider that zero is an even number and a lot of numbers may not have any prime factors with exponent bigger than 1.

Here’s a little Python code to compute the ratio for N = 106.

    from sympy.ntheory import factorint

    def f(n):
        v = factorint(n).values()
        n = len([x for x in v if x > 1])
        return n % 2 == 0

    c = 0
    N = 1000000
    for n in range(N):
        if f(n):
            c += 1
    print (c/N)

This gives a ratio of 0.661344. The limiting value is 0.661370494….

Computing the Feller-Tornier constant

The code above gives an estimate for the Feller-Tornier constant, but how accurate is it? Is there a more efficient way to estimate it that doesn’t require factoring a lot of numbers?

The constant is given analytically by

c = \frac{1}{2} + \frac{1}{2}\prod_{n=1}^\infty \left(1 - \frac{2}{p_n^2} \right )

where pn is the nth prime.

Here’s some Python code to estimate the Feller-Tornier constant using the expression above.

    N = 1000000
    product = 1
    for p in primerange(2, N):
        product *= 1 - 2*p**-2
    print(0.5*(1+product))

Note that we used primerange rather than prime(n). If you need to generate a range of prime numbers, the former is much more efficient.

The code gives us an estimate of 0.66131707, which is good to seven decimal places. Furthermore, we know this is an upper bound on the exact value because we’ve left out terms in the infinite product that are less than 1.

One could calculate a lower bound as well and thus know an interval that must contain the true value. The lower bound is left as an exercise to the reader.

Translating Robert Burns

Last year Adam Roberts had some fun with Finnegans Wake [1], seeing how little he could edit it and turn it into something that sounded like Return of the Jedi. I wrote a blog post where I quantified the difference between the original and the parody using Levenshtein distance, basically how many edits it takes to go from one to the other.

This morning I wanted to post an example of a more likely use of Levenshtien distance. I’m going to look at the final verse of To a Louse by Robert Burns, and compute the distance between the original Scots version and a translation in to more standard English.

Here’s the original:

O wad some Pow’r the giftie gie us
To see oursels as ithers see us!
It wad frae mony a blunder free us,
An’ foolish notion:
What airs in dress an’ gait wad lea’e us,
An’ ev’n devotion!

And here’s the translation:

Oh, would some Power give us the gift
To see ourselves as others see us!
It would from many a blunder free us,
And foolish notion:
What airs in dress and gait would leave us,
And even devotion!

The edit distance between the two versions of the verse is 34. The original has 186 characters, so the translation is about 18% different than the original.

Hirschberg’s sequence alignment algorithm shows how to line up each version with the other.

From Scots to the translation:

    O|| wa|||d some Pow'|r ||||||||the giftie gie us
    To see oursel||s as i|thers see us!
    It wa|||d frae|| mo|ny a blunder free us,
    An'| foolish notion:
    What airs in dress an'| gait wa|||d lea'|e us,
    An'| ev'|n devotion!

And from the translation back to Scots:

    Oh, w|ould some Pow|er give us the gift|||||||||
    To see ourselves as |others see us!
    It w|ould fr||om m|any a blunder free us,
    An|d foolish notion:
    What airs in dress an|d gait w|ould lea|ve us,
    An|d ev|en devotion!

I first heard this poem as a child, and I think about it fairly often. Part of my job as a consultant is to show companies how I as an outsider see their projects. I need the same input, so I turn to advisors to free me from blunders and foolish notions.

Related posts

[1] Of all books I have never read and have no intention of reading, Finnigans Wake is probably the one I’ve referred to the most. It’s so ridiculously difficult to read that it makes good raw material for humorous posts.

Protecting privacy while keeping detailed date information

A common attempt to protect privacy is to truncate dates to just the year. For example, the Safe Harbor provision of the HIPAA Privacy Rule says to remove “all elements of dates (except year) for dates that are directly related to an individual …” This restriction exists because dates of service can be used to identify people as explained here.

Unfortunately, truncating dates to just the year ruins the utility of some data. For example, suppose you have a database of millions of individuals and you’d like to know how effective an ad campaign was. If all you have is dates to the resolution of years, you can hardly answer such questions. You could tell if sales of an item were up from one year to the next, but you couldn’t see, for example, what happened to sales in the weeks following the campaign.

With differential privacy you can answer such questions without compromising individual privacy. In this case you would retain accurate date information, but not allow direct access to this information. Instead, you’d allow queries based on date information.

Differential privacy would add just enough randomness to the results to prevent the kind of privacy attacks described here, but should give you sufficiently accurate answers to be useful. We said there were millions of individuals in the database, and the amount of noise added to protect privacy goes down as the size of the database goes up [1].

There’s no harm in retaining full dates, provided you trust the service that is managing differential privacy, because these dates cannot be retrieved directly. This may be hard to understand if you’re used to traditional deidentification methods that redact rows in a database. Differential privacy doesn’t let you retrieve rows at all, so it’s not necessary to deidentify the rows per se. Instead, differential privacy lets you pose queries, and adds as much noise as necessary to keep the results of the queries from threatening privacy. As I show here, differential privacy is much more cautious about protecting individual date information than truncating to years or even a range of years.

Traditional deidentification methods focus on inputs. Differential privacy focuses on outputs. Data scientists sometimes resist differential privacy because they are accustomed to thinking about the inputs, focusing on what they believe they need to do their job. It may take a little while for them to understand that differential privacy provides a way to accomplish what they’re after, though it does require them to change how they work. This may feel restrictive at first, and it is, but it’s not unnecessarily restrictive. And it opens up new possibilities, such as being able to retain detailed date information.

***

[1] The amount of noise added depends on the sensitivity of the query. If you were asking questions about a common treatment in a database of millions, the necessary amount of noise to add to query results should be small. If you were asking questions about an orphan drug, the amount of noise would be larger. In any case, the amount of noise added is not arbitrary, but calibrated to match the sensitivity of the query.