File character counts

Once in a while I need to know what characters are in a file and how often each appears. One reason I might do this is to look for statistical anomalies. Another reason might be to see whether a file has any characters it’s not supposed to have, which is often the case.

A few days ago Fatih Karakurt left an elegant solution to this problem in a comment:

    fold -w1 file | sort | uniq -c

The fold function breaks the content of a file in to lines 80 characters long by default, but you can specify the line width with the -w option. Setting that to 1 makes each character its own line. Then sort prepares the input for uniq, and the -c option causes uniq to display counts.

This works on ASCII files but not Unicode files. For a Unicode file, you might do something like the following Python code.

import collections

count = collections.Counter()
file = open("myfile", "r", encoding="utf8")
for line in file.readlines():
    for c in line.strip("\n"):
        count[ord(c)] += 1

for p in sorted(list(count)):
    print(chr(p), hex(p), count[p])

Survivalist vi

A few days ago I wrote about computational survivalists, people who prepare to be able to work on computers with only software that is available everywhere. Of course nothing is available everywhere, and so each person interprets “everywhere” to mean computers they anticipate using.

If you need to edit a text file on a Windows computer, you can count on Notepad being there. And if you know how to use Windows, you know how to use Notepad. You may not know advanced features (does Notepad even have advanced features?) but you can get around.

On a Unix-like computer, you can count on vi being there. It was written about four decades ago and ships with every Unix-like system. This post will explain the bare minimum to use vi in a pinch.

Navigating

Arrow keys should work as expected. If for whatever reason your keyboard has no arrow keys or the keys don’t work, you can use the four home row keys h, j, k, and l as arrow keys. Use trial and error until you figure out which way each one moves.

vi has two modes, command mode and insert mode, and it opens in command mode by default. Typing an h, for example, does not insert an h into your text but instead moves the cursor to the left.

Editing

To edit a file, you can enter insert mode by typing i. Now the characters you type will be inserted into your file. What if you want to delete a character? You can delete too, but it’s still called “insert” mode. The arrow keys still navigate, but typing an h, for example, will insert an h into your file. To go back to command mode to navigate more, type the Escape key.

Exiting

If you’re not already in command mode, type ESC to enter command mode.

If you haven’t made any changes, you can type :q and Enter from command mode to exit. If you have made changes, you can use :wq to save the file and exit, and :q! to exit without saving.

Alternatives to Escape

If your keyboard doesn’t have an escape key, as on some Macs, you can use Control-[ as an escape key. (There’s a story behind why this works.)

Similar posts

For daily tips on using Unix, follow @UnixToolTip on Twitter.

UnixToolTip twitter icon

What is a privacy budget?

The idea behind differential privacy is that it doesn’t make much difference whether your data is in a data set or not. How much difference your participation makes is made precise in terms of probability statements. The exact definition doesn’t for this post, but it matters that there is an exact definition.

Someone designing a differentially private system sets an upper limit on the amount of difference anyone’s participation can make. That’s the privacy budget. The system will allow someone to ask one question that uses the whole privacy budget, or a series of questions whose total impact is no more than that one question.

If you think of a privacy budget in terms of money, maybe your privacy budget is $1.00. You could ask a single $1 question if you’d like, but you couldn’t ask any more questions after that. Or you could ask one $0.30 question and seven $0.10 questions.

Some metaphors are dangerous, but the idea of comparing cumulative privacy impact to a financial budget is a good one. You have a total amount you can spend, and you can chose how you spend it.

The only problem with privacy budgets is that they tend to be overly cautious because they’re based on worst-case estimates. There are several ways to mitigate this. A simple way to stretch privacy budgets is to cache query results. If you ask a question twice, you get the same answer both times, and you’re only charged once.

(Recall that differential privacy adds a little random noise to query results to protect privacy. If you could ask the same question over and over, you could average your answers, reducing the level of added noise, and so a differentially private system will rightly charge you repeatedly for repeated queries. But if the system adds noise once and remembers the result, there’s no harm in giving you back that same answer as often as you ask the question.)

A more technical way to get more from a privacy budget is to use Rényi differential privacy (RDP) rather than the original ε-differential privacy. The former simplifies privacy budget accounting due to simple composition rules, and makes privacy budgets stretch further by leaning away from worst-case analysis a bit and leaning toward average-case analysis. RDP depends on a tuning parameter that includes ε-differential privacy, so one can control how much RDP acts like ε-differential privacy by adjusting that parameter.

There are other ways to stretch privacy budgets as well. The net effect is that when querying a large database, you can often ask all the questions like, and get sufficiently accurate answers, without worrying about privacy budget.

Related posts

Queueing theory and regular expressions

people waiting in line

Queueing theory is the study of waiting in line. That may not sound very interesting, but the subject is full of surprises. For example, when a server is near capacity, adding a second server can cut backlog not just in half but by an order of magnitude or more. More on that here.

In this post, I’m not going to look at the content of queueing theory but just the name. When seeing the word queueing, many people assume it is misspelled and want to replace it with queuing. As I write this, my spell checker is recommending that change. But the large majority of people in the field call their subject queueing theory.

Queueing has five consecutive vowels. Are there any other English words with so many vowels in a row? To find out, I did an experiment like the classic examples from Kernighan and Pike demonstrating regular expressions by searching a dictionary for words with various patterns.

The first obstacle I ran into is that apparently Ubuntu doesn’t come with a dictionary file. However this post explains how to install one.

The second obstacle was that queueing wasn’t in the dictionary. I first searched the American English dictionary, and it didn’t include this particular word. But then I searched the British English dictionary and found it.

Now, to find all words with five consecutive vowels, I ran this:

    egrep '[aeiou]{5}' british-english

and got back only one result: queueing. So at least in this dictionary of 101,825 words, the only word with five (or more) consecutive vowels is queueing.

Out of curiosity I checked how many words have at least four consecutive vowels. The command

    grep -E "[aeiou]{4}" british-english | wc -l

returns 40, and the corresponding command with the file american-english returns 39. So both dictionaries contain 39 words with exactly four vowels in a row and no more. But are they they same 39 words? In fact they are. And one of the words is queuing, and so apparently that is an acceptable spelling in both American and British English. But as a technical term in mathematics, the accepted spelling is queueing.

Some of the 39 words are plurals or possessives of other words. For example, the list begins with Hawaiian, Hawaiian’s, and Hawaiians. (Unix sorts capital letters before lower case letters.) After removing plurals and possessives, we have 22 words with four consecutive vowels:

  • Hawaiian
  • Iroquoian
  • Kauai
  • Kilauea
  • Louie
  • Montesquieu
  • Quaoar
  • Rouault
  • aqueous
  • gooier
  • gooiest
  • obsequious
  • obsequiously
  • obsequiousness
  • onomatopoeia
  • pharmacopoeia
  • plateaued
  • plateauing
  • queue
  • queued
  • queuing
  • sequoia

It’s interesting that three of the words are related to Hawaii: Kuaui is an Hawaiian island and Kulauea is a Hawaiian volcano.

I’d never heard of Quaoar before. It’s a Kuiper belt object about half the size of Pluto.

Related posts

Hyperexponential and hypoexponential distributions

There are a couple different ways to combine random variables into a new random variable: means and mixtures. To take the mean of X and Y you average their values. To take the mixture of X and Y you average their densities. The former makes the tails thinner. The latter makes the tails thicker. When X and Y are exponential random variables, the mean has a hypoexponential distribution and the mixture has a hyperexponential distribution.

Hypoexponential distributions

Suppose X and Y are exponentially distributed with mean μ. Then their sum X + Y has a gamma distribution with shape 2 and scale μ. The sum has mean 2μ and variance 2μ². The coefficient of variation, the ratio of the standard deviation to the mean, is 1/√2. The hypoexponential distribution is so-called because its coefficient of variation is less than 1, whereas an exponential distribution has coefficient of variation 1 because the mean and standard deviation are always the same.

The means of X and Y don’t have to be the same. When they’re different, the sum does not have a gamma distribution, and so hypoexponential distributions are more general than gamma distributions.

A hypoexponential random variable can also be the sum of more than two exponentials. If it is the sum of n exponentials, each with the same mean, then the coefficient of variation is 1/√n. In general, the coefficient of variation for a hypoexponential distribution is the coefficient of variation of the means [1].

In the introduction we talked about means rather than sums, but it makes no difference to the coefficient of variation because going from sum to mean divides the mean and the standard deviation by the same amount.

Hyperexponential distributions

Hyperexponential random variables are constructed as a mixture of exponentials rather than an average. Instead selecting a value from X and a value from Y, we select a value from X or a value from Y. Given a mixture probability p, we choose a sample from X with probability p and a value from Y with probability 1 – p.

The density function for a mixture is a an average of the densities of the two components. So if X and Y have density functions fX and fY, then the mixture has density

p fX + (1 – p) fY

If you have more than two random variables, the distribution of their mixture is a convex combination of their individual densities. The coefficients in the convex combination are the probabilities of selecting each random variable.

If X and Y are exponential with means μX and μY, and we have a mixture that selects X with probability p, then then mean of the mixture is the mixture of the means

μ = p μX + (1 – p) μY

which you might expect, but the variance

σ² = p μX ² + (1 – p) μY ² + p(1 – p)(μX – μY

is not quite analogous because of the extra p(1 – p)(μX – μY)² term at the end. If μX = μY this last term drops out and the coefficient of variation is 1: mixing two identically distributed random variables doesn’t do anything to the distribution. But when the means are different, the coefficient of variation is greater than 1 because of the extra term in the variance of the mixture.

Example

Suppose μX = 2 and μY = 8. Then the average of X and Y has mean 5, and so does an equal mixture of X and Y.

The average of X and Y has standard deviation √17, and so the coefficient of variation is √17/5 = 0.8246.

An exponential distribution with mean 5 would have standard deviation 5, and so the coefficient of variation 1.

An equal mixture of X and Y has standard deviation √43, and so the coefficient of variation is √43/5 = 1.3114.

Related posts

[1] If exponential random variables Xi have means μi, then the coefficient of variation of their sum (or average) is

√(μ1² + μ2² + … + μn²) / (μ1 + μ2 + … + μn)

Primes that don’t look like primes

Primes usually look odd. They’re literally odd [1], but they typically don’t look like they have a pattern, because a pattern would often imply a way to factor the number.

However, 12345678910987654321 is prime!

I posted this on Twitter, and someone with the handle lagomoof replied that the analogous numbers are true for bases 2, 3, 4, 6, 9, 10, 16, and 40. So, for example, the hexadecimal number 123456789abcdef10fedcba987654321 would be prime. The following Python code proves this is true.

    from sympy import isprime

    for b in range(2, 41):
        n = 0
        for i in range(0, b):
            n += (i+1)*b**i
        for i in range(b+1, 2*b):
            n += (2*b-i)*b**i
        print(b, isprime(n))

By the way, some have suggested that these numbers are palindromes. They’re not quite because of the 10 in the middle. You can show that the only prime palindrome with an even number of digits is 11. This is an example of how a pattern in the digits leads to a way to factor the number.

Related posts

[1] “All primes are odd except 2, which is the oddest of all.” — Concrete Mathematics

Computational survivalist

Some programmers and systems engineers try to do everything they can with basic command line tools on the grounds that someday they may be in an environment where that’s all they have. I think of this as a sort of computational survivalism.

I’m not much of a computational survivalist, but I’ve come to appreciate such a perspective. It’s an efficiency/robustness trade-off, and in general I’ve come to appreciate the robustness side of such trade-offs more over time. It especially makes sense for consultants who find themselves working on someone else’s computer with no ability to install software. I’m rarely in that position, but that’s kinda where I am on one project.

Example

I’m working on a project where all my work has to be done on the client’s laptop, and the laptop is locked down for security. I can’t install anything. I can request to have software installed, but it takes a long time to get approval. It’s a Windows box, and I requested a set of ports of basic Unix utilities at the beginning of the project, not knowing what I might need them for. That has turned out to be a fortunate choice on several occasions.

For example, today I needed to count how many times certain characters appear in a large text file. My first instinct was to write a Python script, but I don’t have Python. My next idea was to use grep -c, but that would count the number of lines containing a given character, not the number of occurrences of the character per se.

I did a quick search and found a Stack Overflow question “How can I use the UNIX shell to count the number of times a letter appears in a text file?” On the nose! The top answer said to use grep -o and pipe it to wc -l.

The -o option tells grep to output the regex matches, one per line. So counting the number of lines with wc -l gives the number of matches.

Computational minimalism

Computational minimalism is a variation on computational survivalism. Computational minimalists limit themselves to a small set of tools, maybe the same set of tools as computational survivalist, but for different reasons.

I’m more sympathetic to minimalism than survivalism. You can be more productive by learning to use a small set of tools well than by hacking away with a large set of tools you hardly know how to use. I use a lot of different applications, but not as many as I once used.

Related posts

What exactly is a day?

How many days are in a year? 365.

How many times does the earth rotate on its axis in a year? 366.

If you think a day is the time it takes for earth to rotate once around its axis, you’re approximately right, but off by about four minutes.

What we typically mean by “day” is the time it takes for earth to return to the same position relative to the sun, i.e. the time from one noon to the next. But because the earth is orbiting the sun as well as rotating on its axis, it has to complete a little more than one rotation around its axis to bring the sun back to the same position.

Since there are approximately 360 days in a year, the earth has to rotate about 361 degrees to bring the sun back into the same position. This extra rotation takes 1/360th of a day, or about 4 minutes. (We’ll be more precise below.)

Here’s another way to see that the number of rotations has to be more than the number days in a year. To make the numbers simple, assume there are 360 days in a year. Suppose you’re looking up at the noon sun one day. Next suppose it is exactly 180 days later and the earth had rotated 180 times. Because the earth is on the opposite side of the sun, you would be on the opposite side of the sun too. For it to be noon again 180 days later, the earth would have to have made 180.5 rotations, making 361 rotations in a 360-day year.

Sidereal days

Imagine an observer looking straight down at the Earth’s north pole from several light years away. Imagine also an arrow from the center of the Earth to a fixed point on the equator. The time it takes for the observer to see that arrow return to same angle is a sidereal day. The time it takes for that arrow to go from pointing at the sun to pointing at the sun again is a solar day, which is about four minutes longer [1].

If you assume a year has 365.25 (solar) days, but also assume Earth is in a circular orbit around the sun, you’ll calculate a sidereal day to be about 3 minutes and 57 seconds shorter than a solar day. Taking the elliptical shape of Earth’s orbit into account takes about a second off that amount.

Related posts

[1] Technically this is an apparent solar day, a solar day from the perspective of an observer. The length of an apparent solar day varies through the year, but we won’t go into that here.

Harmonographs

In the previous post, I said that Lissajous curves are the result of plotting a curve whose x and y coordinates come from (undamped) harmonic oscillators. If we add a little bit of dampening, multiplying our cosine terms by negative exponentials, the resulting curve is called a harmonograph.

Here’s a bit of Mathematica code to draw harmonographs.

harmonograph[f1_, f2_, ph1_, ph2_, d1_, d2_, T_] := 
    ParametricPlot[{
        Cos[f1 t + ph1] Exp[-d1 t], 
        Cos[f2 t + ph2] Exp[-d2 t]
    }, {t, 0, T}]

For example,

harmonograph[3, 4, 0, Sqrt[2], 0.01, 0.02, 100]

is a slight variation on one of the curves from the previous post, adding dampening terms exp(-0.01t) and exp(-0.01t). The result is the following.

Harmonograph

You can get a wide variety of graphs by playing with the parameters. A couple years ago I wrote about an attempt to reproduce the background of the image in the poster for Alfred Hitchcock’s movie Vertigo. The curve is another example of a harmonograph.

Vertigo harmonograph

Related posts

Lissajous curves and knots

Suppose that over time the x and y coordinates of a point are both given by a harmonic oscillator, i.e.

x(t) = cos(nx t + φx)
y(t) = cos(ny t + φy)

Then the resulting path is called a Lissajous curve.

If you add a z coordinate also given by harmonic oscillator

z(t) = cos(nz t + φz)

then the result is a Lissajous knot. This means that if we project a Lissajous knot onto a coordinate plane, we get a Lissajous curve.

For example, here is a Lissajous knot. See the code at the bottom for parameter details.

Lissajous knot

Here are its projections onto the xy plane,

xy-plane projection

the yz plane,

yz plane projection

and the xz plane.

xz plane projection

If you’ve ever seen an oscilloscope, you may have seen Lissajous curves on its screen.

In order to get a knot, we require that the frequencies, the n‘s, are pairwise relatively prime. We also require that the curve not intersect itself, which imposes a constraint on the phases, the φ’s, as well.

Last week I used torus knots as an example of a family of knots. The Lissajous knots are a distinct family of knots: no torus knot is also a Lissajous knot.

Here’s the Mathematica code to produce the images above.

x[t_] := Cos[3 t]
y[t_] := Cos[4 t + Sqrt[2]]
z[t_] := Cos[5 t + Sqrt[3]]

ParametricPlot3D[{x[t], y[t], z[t]}, {t, 0, 2 Pi}]

ParametricPlot[{x[t], y[t]}, {t, 0, 2 Pi}]
ParametricPlot[{y[t], z[t]}, {t, 0, 2 Pi}]
ParametricPlot[{x[t], z[t]}, {t, 0, 2 Pi}]