Small-scale automation

gears

Saving keystrokes is overrated, but maintaining concentration is underrated.

This post is going to look at automating small tasks in order to maintain concentration, not to save time.

If a script lets you easily carry out some ancillary task without taking your concentration off your main task, that’s a big win. Maybe the script only saves you five seconds, but it could save you from losing a train of thought.

If your goal in writing a script is to preserve concentration, that script has to be effortless to run. It’s worth taking a few minutes to search for a script that is going to save you an hour. But if the purpose of a script is to preserve your state of flow, having to search for it defeats the purpose.

Remembering what you’ve written

I’ve often said to myself “I’ve had to do this task several times. I should automate it!” Good idea, except I couldn’t quickly find the code later when I needed it.

I’ve heard many people say you should automate repetitive tasks; I’ve never heard anyone discuss the problem of remembering what you’ve automated and how to invoke it. Maybe there’s less discussion of the memory problem because the solution is personal. It depends, for instance, on what tools you use and what you find memorable.

One suggestion would be to periodically review what you’ve written, say once a month [1]. Maybe you’ve put useful aliases in your shell configuration file. Skimming that config file occasionally could help you remember what’s there. If you have a directory where you keep custom scripts, it could help to browse that directory once in a while. It helps if there aren’t too many places you need to look, which leads to the next section.

Tool priorities

It would also help to minimize the number of tools you use, or at least the number of tools you customize.

And even with a very minimal tool set, it helps to have a primary emphasis on one of those tools. For example, maybe your work environment consists mostly of a shell, a programming language, and an editor. When it’s not obvious which tool to pick, are you going to write a shell script, a program, or an editor extension? By picking one tool as your default, you get better at that tool, accumulate more sample code for that tool, and have fewer contexts to explore when you’re looking for something you’ve written.

***

[1] A long time ago I heard someone say he reads documentation ever Friday afternoon. I did that for a while and recommend it. Maybe set aside a few minutes each Friday afternoon to review and tweak config files. If you don’t get through everything, pick up next week where you left off.

Number of bits in a particular integer

When I think of bit twiddling, I think of C. So I was surprised to read Paul Khuong saying he thinks of Common Lisp (“CL”).

As always when working with bits, I first doodled in SLIME/SBCL: CL’s bit manipulation functions are more expressive than C’s, and a REPL helps exploration.

I would not have thought of Common Lisp being more expressive for bit manipulation than C, though in hindsight perhaps I should have. Common Lisp is a huge language, and a lot of thought went into it. It’s a good bet that if CL supports something it supports it well.

One of the functions Khoung uses is integer-length. I looked it up in Guy Steele’s book. Here’s what he says about the function.

This function performs the computation

ceiling(log2(if integer < 0 theninteger else integer + 1))

… if integer is non-negative, then its value can be represented in unsigned binary form in a field whose width is no smaller than (integer-length integer). …

Steele also describes how the function works for negative arguments and why this is useful. I’ve cut these parts out because they’re not my focus here.

I was curious how you’d implement integer-length in C, and so I turned to Hacker’s Delight. This book doesn’t directly implement a counterpart to integer-length, but it does implement the function nlz (number of leading zeros), and in fact implements it many times. Hacker’s Delight points out that for a 32-bit unsigned integer x,

⌊log2(x)⌋ = 31 – nlz(x)

and

⌈log2(x)⌉ = 32 – nlz(x -1).

So nlz(x) corresponds to 32 − (integer-length x).

Hacker’s Delight implements nlz at least 10 times. I say “at least” because it’s unclear whether a variation of sample code discussed in commentary remarks counts as a separate implementation.

Why so many implementations? Typically when you’re doing bit manipulation, you’re concerned about efficiency. Hacker’s Delight gives a variety of implementations, each of which may have advantages in different hardware. For example, one implementation is recommended in the case that your environment has a microcode implementation of popcount. The book also gives ways to compute nlz that involve casting an integer to a floating point number. The advisability of such a technique will be platform-dependent.

If you’re looking for C implementations of integer-length you can find a few on Sean Anderson’s Bit Twiddling Hacks page.

Related posts

Proof of optimization

Suppose you hire me to solve an optimization problem for you. You want me to find the value of x that minimizes f(x). I go off and work on finding the best value of x. I report back what I found, and you might say “Thanks, That’s a good value of x. But how do I know there’s not an even better value?”

In general this is a hard question to answer. If x were a single number, maybe I could produce a plot of f and show that my x is where f takes on its smallest value. But usually x is a vector, maybe a thousand-dimensional vector. I’m not very good at graphing functions in a thousand dimensions, so this approach isn’t going to work.

I may be able to back up my result by defending the process used to produce it. For example, maybe you ask me for the shortest path through the 254 counties in Texas and I come back with the following tour:

If you ask whether this is optimal, I’ll have to admit that I’m not certain that it is, but I am certain that it is close. The tour was produced using Mathematica, which in turn uses Bill Cook’s Concorde algorithm, which is known to produce near-optimal results. If the tour above isn’t the shortest, the shortest tour isn’t much shorter.

But if the optimization problem is convex then you may be able to certify the result analogous to the way you can certify primes. In general a certificate gives you a way to verify a solution with much less effort than it took to find the solution.

You can prove things by moving back and forth between your original (“primal”) problem and its dual, or between a variation of your original problem and its dual. You may be able to certify that no solution to the constraints of the original problem exists, or certify that a proposed solution is or isn’t optimal. In every case, demonstrating one solution to the dual problem proves something about all solutions to the primal problem.

Related posts

Self-documenting software

programmer using a laptop in the dark

The electricity went out for a few hours recently, and because the power was out, the internet was out. I was trying to do a little work on my laptop, but I couldn’t do what I intended to do because I needed a network connection to access some documentation. I keep offline documentation for just this situation, but the information I needed wasn’t in my local files. Or maybe it was there, but I gave up too soon.

This made me think of the Emacs slogan that it is a self-documenting editor. It’s also a very old editor, with roots going back to the 1970s. Originally the phrase “self-documenting” contrasted with software that only had paper documentation. Now it’s common for software to have online documentation, but most software still isn’t self-documenting in the way that Emacs is. The documentation for Emacs is extensive, well-written, and thoroughly integrated with the editor.

Most of the software I use has local documentation, but the documentation is more difficult to use than doing a web search. Maybe the local documentation would be easier to use if I invested more time learning how to use it, but this investment has to be repeated for each application; every application has its own documentation system.

The best approach may be to commit to a small number of tools and learn how each one’s documentation works. I’ve done the former but wish I’d put more work into the latter sooner.

Years ago I had gotten to the point that I was using a menagerie of different software applications, none of which I knew well. Following the advice to use the best tool for the job lead to too many tools for wide variety of jobs. I determined that sometimes using a sub-optimal tool would be optimal overall if it allowed me to switch tools less often.

What I didn’t do at the time, but I’d recommend now, is to also to dig into each tool’s documentation system. A web search will always be faster in the moment than learning how to use an arcane help system. (More on this here.) But in the long run, becoming fluent in the local help systems of your most important applications is more efficient, and leads to serendipitous discoveries. It also helps you preserve a state of flow by reducing context switches.

Related posts

Photo by Valeriy Khan on Unsplash

Pratt Primality Certificates

The previous post implicitly asserted that J = 8675309 is a prime number. Suppose you wanted proof that this number is prime.

You could get some evidence that J is probably prime by demonstrating that

2J-1 = 1 mod J.

You could do this in Python by running the following [1].

    >>> J = 8675309
    >>> assert( pow(2, J-1, J) == 1 )

This shows J is a probable prime to base 2.

If you want more evidence, you could also show J is a probable prime to base 3.

    >>> assert( pow(3, J-1, J) == 1 )

But no matter how many bases you try, you won’t have proof that J is prime, only evidence. There are pseudoprimes, (rare) composite numbers that satisfy the necessary-but-not-quite-sufficient conditions of Fermat’s primality test.

Primality certificates

A primality certificate is a proof that a number is prime. To be practical, a certificate must be persuasive and efficient to compute.

We could show that J is not divisible by any integer less than √J. That would actually be practical because J is not that large.

    >>> for n in range(2, int(J**0.5)+1):
    ...     assert(J % n > 0)

But we’d like to use J to illustrate a method that scales to much larger numbers than J.

Pratt certificates

Pratt primality certificates are based on a theorem by Lucas [2] that says a number n is prime if there exists a number a such that two conditions hold:

an-1 = 1 mod n,

and

a(n-1)/p ≠ 1 mod n

for all primes p that divide n-1.

How do you find a? See this post.

Example

To find a Pratt certificate for J, we have to factor J-1. I assert that

J-1 = 8675308 = 4 × 2168827

and that 2168827 is prime. Here’s verification that Lucas’ theorem holds:

    >>> assert( pow(2, (J-1)//2, J) != 1 )
    >>> assert( pow(2, (J-1)//2168827, J) != 1 )

What’s that you say? You’re not convinced that 2168827 is prime? Well then, we’ll come up with a Pratt certificate for 2168827.

Pratt certificates are generally recursive. To prove that p is prime, we have to factor p-1, then prove all the claimed prime factors of p-1 are prime, etc. The recursion ends when it gets down to some set of accepted primes.

Now I assert that

    2168827 - 1 = 2168826 = 2 × 3 × 11 × 17 × 1933

and that all these numbers are prime. I’ll assume you’re OK with that, except you’re skeptical that 1933 is prime.

The following code is proof that 2168827 is prime, assuming 1933 is prime.

    >>> m = 2168827
    >>> for p in [2, 3, 11, 17, 1933]:
    ...     assert( pow(3, (m-1)//p, m) != 1 )

Finally, we’ll prove that 1933 is prime.

You can verify that

    1933 - 1 = 1932 = 2² × 3 × 7 × 23

and I assume you’re convinced that each of these factors is prime.

    >>> m = 1933
    >>> for p in [2, 3, 7, 23]:
    ...     assert( pow(5, (m-1)//p, m) != 1 )

Pratt certificates can be written in a compact form that verification software can read. Here I’ve made the process more chatty just to illustrate the parts.

Update: Here’s a visualization of the process above, drawing arrows from each prime p to the prime factors of p-1.

In this post we’ve agree to recognize 2, 3, 7, 11, 17, and 23 as primes. But you only have to assume 2 is prime. This would make the software implementation more elegant but would make the example tedious for a human consumption.

Efficiency

A primality certificate does not have to be efficient to produce, though of course that would be nice. It has to be efficient to verify. You could imagine that the prime number salesman has more compute power available than the prime number customer. In the example above, I used a computer to generate the Pratt certificate, but it wouldn’t be unreasonable to verify the certificate by hand.

The brute force certificate above, trying all divisors up to √p, obviously takes √p calculations to verify. A Pratt certificate, however, takes about

4 log2 p

calculations. So verifying a 10-digit prime requires on the order of 100 calculations rather than on the order of 100,000 calculations.

Atkin-Goldwasser-Kilian-Morain certificates

Producing Pratt certificates for very large numbers is difficult. Other certificate methods, like Atkin-Goldwasser-Kilian-Morain certificates, scale up better. Atkin-Goldwasser-Kilian-Morain certificates are more complicated to describe because they involve elliptic curves.

Just as Pratt took a characterization of primes by Lucas and turned it into a practical certification method, Atkin and Morain turned a characterization of primes by Goldwasser and Kilian, one involving elliptic curves, and turned it into an efficient certification method.

These certificates have the same recursive nature as Pratt certificates: proving that a number is prime requires proving that another (smaller) number is prime.

Update: More on elliptic curve primaility proofs.

***

[1] This is a more efficient calculation than it seems. It can be done quickly using fast exponentiation. Note that it is not necessary to compute 2J-1 per se; we can carry out every intermediate calculation mod J.

[2] Lucas was French, and so the final s in his name is silent.

The Orange Book

I was spelunking around in Unicode and saw that there’s an emoji for orange book, U+1F4D9.

Orange Book emoji Linux

As is often the case, the emoji renders differently in different contexts. The image above is from my Linux desktop and the image below is from my Macbook. I tried created an image on my Windows box but it didn’t have a glyph for this character.

Orange book emoji Apple

When I saw “orange book” my first thought was the US Department of Defense publication Trusted Computer System Evaluation Criteria (TCSEC), aka DoDD 5200.28-STD, commonly know in security circles as the Orange Book.”For a split second I thought “Is this book so famous that there’s an emoji for it?!” But of course that’s not the case.

There are emoji for several colors of books, and there are several books in the DOD “Rainbow Series” publications. such as the Green Book for password management. (There’s also an emoji for green book: U+1F4D7).

TCSEC

So what is The Orange Book? Here’s a photo of the cover.

Department of Defense Trusted Computer System Evaluation Criteria

Here’s how Bruce Schneier describes the book in Applied Cryptography.

The NCSC publishes the infamous “Orange Book.” It’s actual title is the Department of Defense Trusted Computer System Evaluation Criteria, but that’s a mouthful to say and the book has an orange cover. The Orange Book attempts to define security requirements, gives computer manufacturers an objective way to measure the security of their systems, and guides them as to what to build into their secure products. It focuses on computer security and doesn’t really say a lot about cryptography.

The Orange Book is now deprecated in favor of The Common Criteria for Information Technology Security Evaluation, ISO/IEC 15408.

Related posts

Locally invertible floating point functions

Is the function xx + 2 invertible? Sure, its inverse is the function xx – 2.

Is the Python function

    def f(x): return x + 2

invertible? Not always.

You might reasonably think the function

    def g(x): return x - 2

is the inverse of f, and it is for many values of x. But try this:

    >>> x = 2**53 - 1.0
    >>>  g(f(x)) - x
    -1.0

The composition of f and g does not give us x back because of the limited length of a floating point significand. See Anatomy of a floating point number.

The function f as a function between floating point numbers is locally invertible. That is, it is invertible on a subset of its domain.

Now let’s look at the function

    def f(x): return x*x

Is this function invertible? There is a function, namely sqrt that serves as an inverse to f for many values of x, but not all x. The function sqrt is a partial function because although it is ostensibly a function on floating point numbers, it crashes for negative inputs. The function’s actual domain is smaller than its nominal domain.

Locally invertible functions are an inevitable part of programming, and are awkward to reason about. But there are tools that help. For example, inverse semigroups.

According to nLab

An inverse semigroup is a semigroup S such that for every element sS, there exists a unique “inverse” s* ∈ S such that s s* s = s and s* s s*= s*.

The canonical example of an inverse semigroup, and in some sense the only example, is the following, also from nLab.

For any set X, let I(X) be the set of all partial bijections on X, i.e. bijections between subsets of X. The composite of partial bijections is their composite as relations (or as partial functions).

This is the only example in the sense that the Wagner-Preston theorem says every inverse semigroup is isomorphic to a group of this form.

In our case, the set X is the set of representable floating point numbers, and locally invertible functions are functions which are invertible, but only when restricted to a subset of X.

Repunits: primes and passwords

A repunit is a number whose base 10 representation consists entirely of 1s. The number consisting of n 1s is denoted Rn.

Repunit primes

A repunit prime is, unsurprisingly, a repunit number which is prime. The most obvious example is R2 = 11. Until recently the repunit numbers confirmed to be prime were Rn for n = 2, 19, 23, 317, 1031. Now the case for n = 49081 has been confirmed.

R_{49081} = \frac{10^{49081} - 1}{9} = \underbrace{\mbox{111 \ldots 1}}_{\mbox{{\normalsize 49,081 ones}} }

Here is the announcement. The date posted at the top of the page is from March this year, but I believe the announcement is new. Maybe the author edited an old page and didn’t update the shown date.

Repunit passwords

Incidentally, I noticed a lot of repunits when I wrote about bad passwords a few days ago. That post explored a list of commonly used but broken passwords. This is the list of passwords that password cracking software will try first. The numbers Rn are part of the list for the following values of n:

1–45, 47–49, 51, 53–54, 57–60, 62, 67, 70, 72, 77, 82, 84, 147

So 46 is the smallest value of n such that Rn is not on the list. I would not recommend using R46 as a password, but surprisingly there are a lot of worse choices.

The bad password file is sorted in terms of popularity, and you might expect repunits to appear in the file in order, i.e. shorter sequences first. That is sorta true overall. But you can see streaks in the plot below showing multiple runs where longer passwords are more common than shorter passwords.

Exploring bad passwords

If your password is in the file rockyou.txt then it’s a bad password. Password cracking software will find it instantly. (Use long, randomly generated passwords; staying off the list of worst passwords is necessary but not sufficient for security.)

The rockyou.txt file currently contains 14,344,394 bad passwords. I poked around in the file and this post reports some things I found.

To make things more interesting, I made myself a rule that I could only use command line utilities.

Pure numeric passwords

I was curious how many of these passwords consisted only of digits so I ran the following.

    grep -P '^\d+$' rockyou.txt | wc -l

This says 2,346,744 of the passwords only contain digits, about 1 in 6.

Digit distribution

I made a file of digits appearing in the passwords

    grep -o -P '\d' rockyou.txt > digits

and looked at the frequency of digits.

    for i in 0 1 2 3 4 5 6 7 8 9
    do 
        grep -c $i digits
    done

This is what I got:

    5740291
    6734380
    5237479
    3767584
    3391342
    3355180
    3118364
    3100596
    3567258
    3855490

The digits are distributed more evenly than I would have expected. 1’s are more common than other digits, but only about twice as common as the least common digits.

Longest bad passwords

How long is the longest bad password? The command

    wc -L rockyou.txt

shows that one line in the file is 285 characters long. What is this password? The command

    grep -P '.{285}' rockyou.txt

shows that it’s some HTML code. Nice try whoever thought of that, but you’ve been pwned.

A similar search for all-digit passwords show that the longest numeric passwords are 255 digits long. One of these is a string of 255 zeros.

Dictionary words

A common bit of advice is to not choose passwords that can be found in a database. That’s good advice as far as it goes, but it doesn’t go very far.

I used the comm utility to see how many bad passwords are not in the dictionary by running

    comm -23 sorted dict | wc -l

and the answer was 14,310,684. Nearly all the bad passwords are not in a dictionary!

(Here sorted is a sorted version of the rockyou.txt file; I believe the file is initially sorted by popularity, worst passwords first. The comm utility complained that my system dictionary isn’t sorted, which I found odd, but I sorted it to make comm happy and dict is the sorted file.)

Curiously, the command

    comm -13 sorted dict | wc -l

shows there are 70,624 words in the dictionary (specifically, the american-english file on my Linux box) that are not on the bad password list.

Smallest ‘good’ numeric password

What is the smallest number not in the list of pure numeric passwords? The following command strips leading zeros from purely numeric passwords, sorts the results as numbers, removes duplicates, and stores the results in a file called nums.

    grep -P '^\d+$' rockyou.txt | sed 's/^0\+//' | sort -n | uniq > nums

The file nums begins with a blank. I removed this with sed.

    sed -i 1d nums

Next I used awk to print instances where the line number does not match the line in the file nums.

    awk '{if (NR-$0 < 0) print $0 }' nums | less

The first number this prints is 61. This means that the first line is 1, the second line is 2, and so on, but the 60th line is 61. That means 60 is missing. The file rockyou.txt does not contain 60. You can verify this: the command

    grep '^60$' rockyou.txt

returns nothing. 60 is the smallest number not in the bad password file. There are passwords that contain ’60’ as a substring, but just 60 as a complete password is not in the file.

Related posts

Repeat shell command replacing a word

picasso with shells

Suppose you’ve typed a long command and you need to rerun it with a small modification. Say you need to replace foo with bar. Bash will let you do this with ^foo^bar^. And although you’re supposed to put the final caret on the end, it will let you get by without it.

    $ echo foo
    foo
    $ ^foo^bar
    bar

Great. Now suppose foo appears twice in a command.

    $ echo foo foo
    foo foo
    $ ^foo^bar
    bar foo

Surprise! Only the first foo gets replaced.

The way to fix this is to use !:gs/foo/bar/ to do a global replacement, i.e. to replace all instances of foo with bar. If you really only want to replace the first instance of foo then use the same command without the g.

    $ echo foo foo
    foo foo
    $ !:gs/foo/bar/
    bar bar
    $ !:s/foo/bar/
    bar foo

You can leave off the final slash and everything will still work.

What about zsh?

The zsh shell works a lot like bash, so much that I forget that I’m using zsh until something unexpected happens. “Oh yeah, I’m on my Mac [1]. This is zsh and not bash.”

So how do the examples above run under zsh? The simple substitution ^foo^bar to replace the first instance of foo with bar works just the same.

However, zsh offers another possibility, one that isn’t supported in bash. You can append :G to the substitution command with carets.

    % echo foo foo
    foo foo
    % ^foo^bar^:G
    bar bar

Here the final caret is critical. Without it you replace the first instance of foo with bar:G.

Event designators

These features are well documented, but it’s hard to find the documentation if you don’t know the buzzword to search for. If you try searching on “string replacement” and things like that, you’re likely to find other things that what you’re looking for.

For bash, the magic phrase is “event designator.” If you type man bash and scroll down to the section entitled Event Designators you’ll find everything mentioned here and more. On zsh the magic phrase is “history expansion.”

Related posts

[1] I use bash on Linux and zsh on Mac. When in Rome, do as the Romans do. Though I do remap a few keys in order to use muscle memory across platforms.