Ease of learning vs relearning

Much more is written about how easy or hard some technology is to learn than about how hard it is to relearn. Maybe this is because people are more eager to write about something while the excitement or frustration of their first encounter is fresh.

Advocates of difficult-to-learn technologies say that tools should be optimized for experienced users, that ease of learning is over-rated because you’re only learn a tool once and use it for much longer. That makes sense if you use a tool continuously. If you use a tool occasionally, however, you might learn it once and relearn it many times.

The ease of relearning a technology should be emphasized more. As you’re learning a programming language, for example, it may be difficult to imagine forgetting it and needing to relearn it down the road. But you might ask yourself

If I put this down for a couple years and then have to come back to it, what language would I wish I’d written it in?

A while back I debated relearning Perl for the kind of text munging projects that Perl was designed for. But not only would I have to relearn Perl once, I’d have to relearn it every time I revisit the code. Perl does not stick in my head without constant use. Awk, on the other hand, is small and simple, and has a lot of the benefits of Perl. You can learn the basics of Awk in a day, and so if you have to, you can relearn it in a day.

Something easy to learn is also easy to relearn.

However, the converse isn’t necessarily true. Some things may be hard to learn but easy to pick back up. For example, I found LaTeX hard to learn but easy to relearn after not using it for several years. A lot of other tools seem almost as hard to relearn every time I pick them up. I think part of what made LaTeX easy to pick back up was its internal consistency. It’s a little quirky, but it has conceptual integrity.

Conceptual integrity

I’ve used Mathematica off and on ever since it came out. Sometimes I’d go for years without using it, but it has always been easy to pick back up. Mathematica is easy to return to because its syntax is consistent and predictable. Mathematica has conceptual integrity. I find R much harder to use because the inconsistent syntax fades from my memory between uses.

Conceptual integrity comes from strong leadership, even a “benevolent dictator.” Donald Knuth shaped TeX and Stephen Wolfram shaped Mathematica. R has been more of an egalitarian effort, and it shows.

The “Tidyverse” of libraries on top of R is more consistent than the base language, due to the Hadley Wickham doing so much of the work himself. In fact, the Tidyverse was initially called the “Hadleyverse,” though Hadley didn’t like that name.

Accelerating an alternating series

The most direct way of computing the sum of an alternating series, simply computing the partial sums in the terms get small enough, may not be the most efficient. Euler figured this out in the 18th century.

For our demo we’ll evaluate the Struve function defined by the series

H_\nu(z) = (z/2)^{\nu + 1} \sum_{k=0}^\infty (-1)^k \frac{(z/2)^{2k}}{\Gamma\left(k + 3/2\right ) \, \Gamma\left(k + \nu + 3/2 \right )}

Note that the the terms in the series are relatively expensive to evaluate since each requires evaluating a gamma function. Euler’s acceleration method will be inexpensive relative to computing the terms it takes as input.

Here’s what we get by evaluating the first partial sums for H1.2(3.4):

2.34748
0.67236
1.19572
1.10378
1.11414
1.11332

So if we were to stop here, we’d report 1.11332 as the value of H1.2(3.4).

Next let’s see what we’d get using Euler’s transformation for alternating series. I’ll full precision in my calculations internally but only displaying four digits to save horizontal space that we’ll need shortly.

2.3474 1.5099 
0.6723 0.9340 
1.1957 1.1497 
1.1037 1.1089  
1.1141 1.1137
1.1133 

Now we repeat the process again, taking averages of consecutive terms in each column to produce the next column.

2.3474 1.5099 1.2219 1.1319 1.1087 1.1058
0.6723 0.9340 1.0418 1.0856 1.1029 
1.1957 1.1497 1.1293 1.1203 
1.1037 1.1089 1.1113 
1.1141 1.1137 
1.1133 

The terms across the top are the Euler approximations to the series. The final term 1.1058 is the Euler approximation based on all six terms. And it is worse than what we get from simply taking partial sums!

The exact value is 1.11337… and so the sixth partial sum was accurate to four decimal places, but our clever method was only good to one decimal place.

What went wrong?! Euler’s method is designed to speed up the convergence of slowly converging alternating series. But our series converges pretty quickly because it has two terms in the denominator that grow like factorials. When you apply acceleration when you don’t need to, you can make things worse. Since our series was already converging quickly, we would have done better to use Aitken acceleration, the topic of the next post.

But Euler’s method works well when it’s needed. For example, let’s look at the slowly converging series

π = 4 – 4/3 + 4/5 – 4/7 + 4/9 – …

Then we get the following array.

4.0000 3.3333 3.200o 3.1619 3.1492 3.1445
2.6666 3.0666 3.1238 3.1365 3.1399 
3.4666 3.1809 3.1492 3.1434
2.8952 3.1174 3.1376 
3.3396 3.1578
2.9760

The sequence of partial sums is along the left side, and the series of Euler terms is across the top row. Neither gives a great approximation to π, but the approximation error using Euler’s acceleration method is 57 times smaller than simply using the partial sums.

Related posts

Journalistic stunt with Emacs

Emacs has been called a text editor with ambitions of being an operating system, and some people semi-seriously refer to it as their operating system. Emacs does not want to be an operating system per se, but it is certainly ambitious. It can be a shell, a web browser, an email client, a calculator, a Lisp interpreter, etc. It’s possible to work all day and never leave Emacs. It would be an interesting experiment to do just that.

Journalist experiment

Journalists occasionally impose some restriction on themselves and write about the experience. For example, Kashmir Hill did an experiment earlier this year, blocking the Big Five tech companies—Amazon, Facebook, Google, Microsoft, and Apple—for a week each, then finally all in the same week, and wrote a series about her experience. It would be interesting for someone to work only from Emacs for a week and write about it.

Living exclusively inside Emacs would be hard. Emacs applications require effort to discover and learn how to use, and different people find different applications worth learning. Someone doing everything in Emacs for the sake of a story would have to use some features they would not otherwise find worthwhile.

Why stay inside Emacs?

The point of using a calculator, for example, inside Emacs is that it lets you stay in your primary work environment. You don’t have to open a new application to do a quick calculation. Also, since everything is text-based, everything can be navigated and edited the same way.

You may have run into a situation using Windows where some text can be copied, such as text inside an edit box, but other text cannot, such as text displayed on a dialog box. That doesn’t happen in Emacs since everything is editable text. Consistency and interoperability sometimes make it worthwhile to do things inside Emacs that could be done more easily in another application.

Finally, everything in Emacs is programmable. Something that is awkward to use manually might still be valuable since it can be automated.

Examples from recent posts

My previous post was about various ways to compute hash functions. I could have added Emacs to the list. Here’s how you could compute the SHA256 hash of “hello world” using Emacs Lisp:

    (secure-hash 'sha256 "hello world")

You could, for example, type the code above in the middle of a document and type Control-x Control-e to evaluate it as a Lisp expression.

I also wrote about software to factor integers recently, and you could do this in Emacs as well. You could pull up the Emacs calculator and type prfac(161393) for example and it would return a list of the prime factors: [251, 643].

Neither of these functions is best of breed. The secure-hash function only supports the most popular hash functions, unlike openssl. And prfac will work fail on large inputs, unlike PARI/GP. Emacs is ambitions, but not that ambitious. It doesn’t aim to replace specialized software, but to provide a convenient way to carry out common tasks.

Notes on computing hash functions

A secure hash function maps a file to a string of bits in a way that is hard to reverse. Ideally such a function has three properties:

  1. pre-image resistance
  2. collision resistance
  3. second pre-image resistance

Pre-image resistance means that starting from the hash value, it is very difficult to infer what led to that output; it essentially requires a brute force attack, trying many inputs until something hashes to the given value.

Collision resistance means its extremely unlikely that two files would map to the same hash value, either by accident or by deliberate attack.

Second pre-image resistance is like collision resistance except one file is fixed. A second pre-image attack is harder than a collision attack because the attacker can only vary one file.

This post explains how to compute hash functions from the Linux command line, from Windows, from Python, and from Mathematica.

Files vs strings

Hash functions are often applied to files. If a web site makes a file available for download, and publishes a hash value, you can compute the hash value yourself after downloading the file to make sure they match. A checksum could let you know if a bit was accidentally flipped in transit, but it’s easy to deliberately tamper with files without changing the checksum. But a secure hash function makes such tampering unfeasible.

You can think of a file as a string or a string as a file, but the distinction between files and strings may matter in practice. When you save a string to a file, you might implicitly add a newline character to the end, causing the string and its corresponding file to have different hash values. The problem is easy to resolve if you’re aware of it.

Another gotcha is that text encoding matters. You cannot hash text per se; you hash the binary representation of that text. Different representations will lead to different has values. In the examples below, only Python makes this explicit.

openssl digest

One way to compute hash values is using openssl. You can give it a file as an argument, or pipe a string to it.

Here’s an example creating a file f and computing its SHA256 hash.

    $ echo "hello world" > f
    $ openssl dgst -sha256 f
    SHA256(f)= a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447

We get the same hash value if we pipe the string “hello world” to openssl.

    $ echo "hello world" | openssl dgst -sha256
    a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447

However, echo silently added a newline at the end of our string. To get the hash of “hello world” without this newline, use the -n option.

    $ echo -n "hello world" | openssl dgst -sha256
    b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9

To see the list of hash functions openssl supports, use list --digest-commands. Here’s what I got, though the output could vary with version.

    $ openssl list --digest-commands
    blake2b512 blake2s256 gost     md4
    md5        mdc2       rmd160   sha1
    sha224     sha256     sha3-224 sha3-256
    sha3-384   sha3-512   sha384   sha512
    sha512-224 sha512-256 shake128 shake256
    sm3

A la carte commands

If you’re interested in multiple hash functions, openssl has the advantage of handling various hashing algorithms uniformly. But if you’re interested in a particular hash function, it may have its only command line utility, such as sha256sum and md5sum. But these are not named consistently. For example, the utility to compute BLAKE2 hashes is b2sum.

hashalot

The hashalot utility is designed for hashing passphrases. As you type in a string, the characters are not displayed, and the input is hashed without a trailing newline character.

Here’s what I get when I type “hello world” at the passphrase prompt below.

    $ hashalot -x sha256
    Enter passphrase:
    b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9

The -x option tells hashalot to output hexadecimal rather than binary.

Note that this produces the same output as

    echo -n "hello world" | openssl dgst -sha256

According to the documentation,

    Supported values for HASHTYPE:
    ripemd160 rmd160 rmd160compat sha256 sha384 sha512

Python hashlib

Python’s hashlib library supports several hashing algorithms. And unlike the examples above, it makes the encoding of the input and output explicit.

    import hashlib
    print(hashlib.sha256("hello world".encode('utf-8')).hexdigest())

This produces b94d…cde9 as in the examples above.

hashlib has two attributes that let you know which algorithms are available. The algorithms_available attribute is the set of hashing algorithms available in your particular instance, and the algorithms_guaranteed attribute is the set of algorithm guaranteed to be available anywhere the library is installed.

Here’s what I got on my computer.

    >>> a = hashlib.algorithms_available
    >>> g = hashlib.algorithms_guaranteed
    >>> assert(g.issubset(a))
    >>> g
    {'sha1', 'sha512', 'sha3_224', 'shake_256', 
     'sha3_256', 'sha256', 'shake_128', 'sha224', 
     'md5', 'sha384', 'blake2s', 'sha3_512', 
     'blake2b', 'sha3_384'}
   >>> a.difference(g)                                                             
   {'md5-sha1', 'mdc2', 'sha3-384', 'ripemd160', 
    'blake2s256', 'md4', 'sha3-224', 'whirlpool', 
    'sha512-256', 'blake2b512', 'sha512-224', 'sm3', 
   'shake128', 'shake256', 'sha3-512', 'sha3-256'}                                                    

Hashing on Windows

Windows has a utility fciv whose name stands for “file checksum integrity verifier”. It only supports the broken hashes MD5 and SHA1 [1].

PowerShell has a function Get-FileHash that uses SHA256 by default, but also supports SHA1, SHA384, SHA512, and MD5.

Hashing with Mathematica

Here’s our running example, this time in Mathematica.

    Hash["hello world", "SHA256", "HexString"]

This returns b94d…cde9 as above. Other hash algorithms supported by Mathematica: Adler32, CRC32, MD2, MD3, MD4, MD5, RIPEMD160, RIPEMD160SHA256, SHA, SHA256, SHA256SHA256, SHA384, SHA512, SHA3-224, SHA3-256, SHA3-384, SHA3-512, Keccak224, Keccak256, Keccak384, Keccak512, Expression.

Names above that concatenate two names are the composition of the two functions. RIPEMD160SHA256 is included because of its use in Bitcoin. Here “SHA” is SHA-1. “Expression” is a non-secure 64-bit hash used internally by Mathematica.

Mathematica also supports several output formats besides hexadecimal: Integer, DecimalString, HexStringLittleEndian, Base36String, Base64Encoding, and ByteArray.

Related posts

[1] It’s possible to produce MD5 collisions quickly. MD5 remains commonly used, and is fine as a checksum, though it cannot be considered a secure hash function any more.

Google researchers were able to produce SHA1 collisions, but it took over 6,000 CPU years distributed across many machines.

Software to factor integers

In my previous post, I showed how changing one bit of a semiprime (i.e. the product of two primes) creates an integer that can be factored much faster. I started writing that post using Python with SymPy, but moved to Mathematica because factoring took too long.

SymPy vs Mathematica

When I’m working in Python, SymPy lets me stay in Python. I’ll often use SymPy for a task that Mathematica could do better just so I can stay in one environment. But sometimes efficiency is a problem.

SymPy is written in pure Python, for better and for worse. When it comes to factoring large integers, it’s for worse. I tried factoring a 140-bit integer with SymPy, and killed the process after over an hour. Mathematica factored the same integer in 1/3 of a second.

Mathematica vs PARI/GP

The previous post factors 200-bit semiprimes. The first example, N = pq where

p = 1078376712338123201911958185123
q = 1126171711601272883728179081277

took 99.94 seconds to factor using Mathematica. A random sample of 13 products of 100-bit primes and they took an average of 99.1 seconds to factor.

Using PARI/GP, factoring the value of N above took 11.4 seconds to factor. I then generated a sample of 10 products of 100-bit primes and on average they took 10.4 seconds to factor using PARI/GP.

So in these examples, Mathematica is several orders of magnitude faster than SymPy, and PARI/GP is one order of magnitude faster than Mathematica.

It could be that the PARI/GP algorithms are relatively better at factoring semiprimes. To compare the efficiency of PARI/GP and Mathematica on non-semiprimes, I repeated the exercise in the previous post, flipping each bit of N one at a time and factoring.

This took 240.3 seconds with PARI/GP. The same code in Mathematica took 994.5 seconds. So in this example, PARI/GP is about 4 times faster where as for semiprimes it was 10 times faster.

Python and PARI

There is a Python interface to PARI called cypari2. It should offer the convenience of working in Python with the efficiency of PARI. Unfortunately, the installation failed on my computer. I think SageMath interfaces Python to PARI but I haven’t tried it.

Related posts

Why are regular expressions difficult?

Regular expressions are challenging, but not for the reasons commonly given.

Non-reasons

Here are some reasons given for the difficulty of regular expressions that I don’t agree with.

Cryptic syntax

I think complaints about cryptic syntax miss the mark. Some people say that Greek is hard to learn because it uses a different alphabet. If that were the only difficulty, you could easily learn Greek in a couple days. No, Greek is difficult for English speakers to learn because it is a very different language than English. The differences go much deeper than the alphabet, and in fact that alphabets are not entirely different.

The basic symbol choices for regular expressions — . to match any character, ? to denote that something is optional, etc. — were arbitrary, but any choice would be. As it is, the chosen symbols are sometimes mnemonic, or at least consistent with notation conventions in other areas.

Density

Regular expressions are dense. This makes them hard to read, but not in proportion to the information they carry. Certainly 100 characters of regular expression syntax is harder to read than 100 consecutive characters of ordinary prose or 100 characters of C code. But if a typical pattern described by 100 characters of regular expression were expanded into English prose or C code, the result would be hard to read as well, not because it is dense but because it is verbose.

Crafting expressions

The heart of using regular expressions is looking at a pattern and crafting a regular expression to match that pattern. I don’t think this is difficult, especially when you have some error tolerance. Very often in applications of regular expressions it’s OK to have a few false positives, as long as a human scans the output. For example, see this post on looking for ICD-10 codes.

I suspect that many people who think that writing regular expressions is difficult actually find some peripheral issue difficult, not the core activity of describing patterns per se.

Reasons

Now for what I believe are reasons why regular expressions are.

Overloaded syntax and context

Regular expressions use a small set of symbols, and so some of these symbols to double duty. For example, symbols take on different meanings inside and outside of character classes. (See point #4 here.) Extensions to the basic syntax are worse. People wanting to add new features to regular expressions ‐ look-behind, comments, named matches, etc. — had to come up with syntax that wouldn’t conflict with earlier usage, which meant strange combinations of symbols that would have previously been illegal.

Dialects

If you use regular expressions in multiple programming languages, you’ll run into numerous slight variations. Can I write \d for a digit, or do I need to write [0-9]? If I want to group a subexpression, do I need to put a backslash in front of the parentheses? Can I write non-capturing groups?

These variations are difficult to remember, not because they’re completely different, but because they’re so similar. It reminds me of my French teacher saying “Does literature have a double t in English and one t in French, or the other way around? I can never remember.”

Use

It’s difficult to remember the variations on expression syntax in various programming languages, but I find it even more difficult to remember how to use the expressions. If you want to replace all instances of some regular expression with a string, the way to do that could be completely different in two languages, even if the languages use the exact same dialect of regular expressions.

Resources

Here are notes on regular expressions I’ve written over the years, largely for my own reference.

Protecting privacy while keeping detailed date information

A common attempt to protect privacy is to truncate dates to just the year. For example, the Safe Harbor provision of the HIPAA Privacy Rule says to remove “all elements of dates (except year) for dates that are directly related to an individual …” This restriction exists because dates of service can be used to identify people as explained here.

Unfortunately, truncating dates to just the year ruins the utility of some data. For example, suppose you have a database of millions of individuals and you’d like to know how effective an ad campaign was. If all you have is dates to the resolution of years, you can hardly answer such questions. You could tell if sales of an item were up from one year to the next, but you couldn’t see, for example, what happened to sales in the weeks following the campaign.

With differential privacy you can answer such questions without compromising individual privacy. In this case you would retain accurate date information, but not allow direct access to this information. Instead, you’d allow queries based on date information.

Differential privacy would add just enough randomness to the results to prevent the kind of privacy attacks described here, but should give you sufficiently accurate answers to be useful. We said there were millions of individuals in the database, and the amount of noise added to protect privacy goes down as the size of the database goes up [1].

There’s no harm in retaining full dates, provided you trust the service that is managing differential privacy, because these dates cannot be retrieved directly. This may be hard to understand if you’re used to traditional deidentification methods that redact rows in a database. Differential privacy doesn’t let you retrieve rows at all, so it’s not necessary to deidentify the rows per se. Instead, differential privacy lets you pose queries, and adds as much noise as necessary to keep the results of the queries from threatening privacy. As I show here, differential privacy is much more cautious about protecting individual date information than truncating to years or even a range of years.

Traditional deidentification methods focus on inputs. Differential privacy focuses on outputs. Data scientists sometimes resist differential privacy because they are accustomed to thinking about the inputs, focusing on what they believe they need to do their job. It may take a little while for them to understand that differential privacy provides a way to accomplish what they’re after, though it does require them to change how they work. This may feel restrictive at first, and it is, but it’s not unnecessarily restrictive. And it opens up new possibilities, such as being able to retain detailed date information.

***

[1] The amount of noise added depends on the sensitivity of the query. If you were asking questions about a common treatment in a database of millions, the necessary amount of noise to add to query results should be small. If you were asking questions about an orphan drug, the amount of noise would be larger. In any case, the amount of noise added is not arbitrary, but calibrated to match the sensitivity of the query.

Random sampling from a file

I recently learned about the Linux command line utility shuf from browsing The Art of Command Line. This could be useful for random sampling.

Given just a file name, shuf randomly permutes the lines of the file.

With the option -n you can specify how many lines to return. So it’s doing sampling without replacement. For example,

    shuf -n 10 foo.txt

would select 10 lines from foo.txt.

Actually, it would select at most 10 lines. You can’t select 10 lines without replacement from a file with less than 10 lines. If you ask for an impossible number of lines, the -n option is ignored.

You can also sample with replacement using the -r option. In that case you can select more lines than are in the file since lines may be reused. For example, you could run

    shuf -r -n 10 foo.txt

to select 10 lines drawn with replacement from foo.txt, regardless of how many lines foo.txt has. For example, when I ran the command above on a file containing

    alpha
    beta
    gamma

I got the output

    beta
    gamma
    gamma
    beta
    alpha
    alpha
    gamma
    gamma
    beta

I don’t know how shuf seeds its random generator. Maybe from the system time. But if you run it twice you will get different results. Probably.

Related

Cosmic rays flipping bits

A cosmic ray striking computer memory at just the right time can flip a bit, turning a 0 into a 1 or vice versa. While I knew that cosmic ray bit flips were a theoretical possibility, I didn’t know until recently that there had been documented instances on the ground [1].

Radiolab did an episode on the case of a cosmic bit flip changing the vote tally in a Belgian election in 2003. The error was caught because one candidate got more votes than was logically possible. A recount showed that the person in question got 4096 more votes in the first count than the second count. The difference of exactly 212 votes was a clue that there had been a bit flip. All the other counts remained unchanged when they reran the tally.

It’s interesting that the cosmic ray-induced error was discovered presumably because the software quality was high. All software is subject to cosmic bit flipping, but most of it is so buggy that you couldn’t rule out other sources of error.

Cosmic bit flipping is becoming more common because processors have become smaller and more energy efficient: the less energy it takes for a program to set a bit intentionally, the less energy it takes for radiation to set a bit accidentally.

Related post: Six sigma events

[1] Spacecraft are especially susceptible to bit flipping from cosmic rays because they are out from under the radiation shield we enjoy on Earth’s surface.