Algorithmic wizardry

Last week I wrote a short commentary on James Hague’s blog post Organization skills beat algorithmic wizardry. This week that post got more traffic than my server could handle. I believe it struck a chord with experienced software developers who know that the challenges they face now are not like the challenges they prepared for in school.

Although I completely agree that “algorithmic wizardry” is over-rated in general, my personal experience has been a little different. My role on projects has frequently been to supply a little bit of algorithmic wizardry. I’ve often been asked to look into a program that is taking too long to run and been able to speed it up by an order of magnitude or two by improving a numerical algorithm. (See an example here.)

James Hague says that “rarely is there some … algorithm that casts a looming shadow over everything else.” I believe he is right, though I’ve been called into projects precisely on those rare occasions when an algorithm does cast a shadow over everything else.

Unix-like shells on Windows

This post gives some notes on ways to create a Unix-like command line experience on Windows, without using a virtual machine like VMWare or a quasi-virtual machine like Cygwin.

Finding Windows ports of Unix utilities is easy. The harder part is finding a shell that behaves as expected. (Of course “as expected” depends on your expectations!)

There have been many projects to port Unix utilities to Windows, particularly GnuWin32 and Gow. Some of the command shells I’ve tried are:

  • Cmd
  • PowerShell
  • Eshell
  • Bash
  • Clink

I’d recommend the combination of Gow and Clink for most people. If you’re an Emacs power user you might like Eshell.

Cmd

The built-in command line on Windows is cmd. It’s sometimes called the “DOS prompt” though that’s misleading. DOS died two decades ago and the cmd shell has improved quite a bit since then.

cmd has some features you might not expect, such as pushd and popd. However, I don’t believe it has anything analogous to dirs to let you see the directory stack.

PowerShell

PowerShell is a very sophisticated scripting environment, but the interactive shell itself (e.g. command editing functionality) is basically cmd. (I haven’t kept up with PowerShell and that may have changed.) This means that writing a PowerShell script is completely different from writing a batch file, but the experience of navigating the command line is essentially the same as cmd.

Eshell

You can run shells inside Emacs. By default, M-x shell brings up a cmd prompt inside an Emacs buffer. You can also use Emacs’ own shell with the command M-x eshell.

Eshell is a shell implemented in Emacs Lisp. Using Eshell is very similar across platforms. On a fresh Windows machine, with nothing like Gow installed, Eshell provides some of the most common Unix utilities. You can use the which command to see whether you’re using a native executable or Emacs Lisp code. For example, if you type which ls into Eshell, you get the response

    eshell/ls is a compiled Lisp function in `em-ls.el'

The primary benefit of Eshell is that provides integration with Emacs. As the documentation says

Eshell is not a replacement for system shells such as bash or zsh. Use Eshell when you want to move text between Emacs and external processes …

Eshell does not provide some of the command editing features you might expect from bash. But the reason for this is clear: if you’re inside Emacs, you’d want to use the full power of Emacs editing, not the stripped-down editing features of a command line. For example, you cannot use ^foo^bar to replace foo with bar in the previous command. Instead, you could retrieve the previous command and edit it just as you edit any other line inside Emacs.

In bash you can use !^ to recall the first argument of the previous command and !$!$ using $_ instead. Many of the other bash shortcuts that begin with ! work as expected: !foo, !!, !-3, etc. Directory navigation commands like cd -, pushd, and popd work as in bash.

Bash

Gow comes with a bash shell, a Windows command line program that creates a bash-like environment. I haven’t had much experience with it, but it seems to be a faithful bash implementation with few compromises for Windows, for better and for worse. For example, it doesn’t understand backslashes as directory separators.

There are other implementations of bash on Windows, but I either haven’t tried them (e.g. win-bash) or have had bad experience with them (e.g. Cygwin).

Clink

Clink is not a shell per se but an extension to cmd. It adds the functionality of the Gnu readline library to the Windows command line and so you can use all the Emacs-like editing commands that you can with bash: Control-a to move to the beginning of a line, Control-k to delete the rest of a line, etc.

Clink also gives you Windows-like behavior that Windows itself doesn’t provide, such as being able to paste text onto the command line with Control-v.

I’ve heard that Clink will work with PowerShell, but I was not able to make it work.

The command editing and history shortcuts beginning with ! mentioned above all work with Clink, as do substitutions like ^foo^bar.

Conclusion

In my opinion, the combination of Gow and Clink gives a good compromise between a Windows and Unix work environment. And if you’re running Windows, a compromise is probably what you want. Otherwise, simply run a (possibly virtual) Linux machine. Attempts to make Windows too Unix-like run down an uncanny valley where it’s easy to waste a lot of time.

QR Codes and Percolation

Percolation theory looks at problems such as the probability of being able to traverse some region with random obstacles. It is motivated by problems such as modeling the flow of a fluid in a porous medium.

Here’s a percolation problem for QR codes: What is the probability that there is a path from one side of a QR code to the opposite side? How far across a QR code would you expect to be able to go? For example, the QR code below was generated from my contact information. It’s not possible to go from one side to the other, and the red line shows what I believe is the deepest path into the code from a side.

This could make an interesting programming exercise. A simple version would be to start with a file of bits representing a particular QR code and find the deepest path into the corresponding image.

The next step up would be to generate simplified QR codes, requiring certain bits to be set, such as the patterns in three of the four corners that allow a QR reader to orient itself.

The next step in sophistication would be to implement the actual QR encoding algorithm, including its error correction encoding, then use this to encode random data.

(Because of the error correction used by QR codes, you could scan the image above and your QR reader would ignore the red path. It would even work if a fairly large portion of the image were missing because the error correction introduces a lot of redundancy.)

Graphemes

Here’s something amusing I ran across in the glossary of Programming Perl:

grapheme A graphene is an allotrope of carbon arranged in a hexagonal crystal lattice one atom thick. Grapheme, or more fully, a grapheme cluster string is a single user-visible character, which in turn may be several characters (codepoints) long. For example … a “ȫ” is a single grapheme but one, two, or even three characters, depending on normalization.

In case the character ȫ doesn’t display correctly for you, here it is:

Unicode character U_022B

First, graphene has little to do with grapheme, but it’s geeky fun to include it anyway. (Both are related to writing. A grapheme has to do with how characters are written, and the word graphene comes from graphite, the “lead” in pencils. The origin of grapheme has nothing to do with graphene but was an analogy to phoneme.)

Second, the example shows how complicated the details of Unicode can get. The Perl code below expands on the details of the comment about ways to represent ȫ.

This demonstrates that the character . in regular expressions matches any single character, but \X matches any single grapheme. (Well, almost. The character . usually matches any character except a newline, though this can be modified via optional switches. But \X matches any grapheme including newline characters.)

   
# U+0226, o with diaeresis and macron 
my $a = "\x{22B}"; 

# U+00F6 U+0304, (o with diaeresis) + macron 
my $b = "\x{F6}\x{304}";    
     
# o U+0308 U+0304, o + diaeresis + macron   
my $c = "o\x{308}\x{304}"; 

my @versions = ($a, $b, $c);

# All versions display the same.
say @versions;

# The versions have length 1, 2, and 3.
# Only $a contains one character and so matches .
say map {length $_ if /^.$/} @versions;

# All versions consist of one grapheme.
say map {length $_ if /^\X$/} @versions;

Perl regex twitter account

I’ve started a new Twitter account @PerlRegex for Perl regular expressions. My original account, @RegexTip, is for regular expressions in general and doesn’t go into much detail regarding any particular implementation. @PerlRegex goes into the specifics of regular expressions in Perl.

Why specifically Perl regular expressions? Because Perl has the most powerful support for regular expressions (strictly speaking, “pattern matching.”) Other languages offer “Perl compatible” regular expressions, though the degree of compatibility varies and is always less than complete.

I imagine more people have ruled England than have mastered the whole of the Perl language. But it’s possible to use Perl for regular expression processing without learning too much of the wider language.

PerlRegex icon

More data, less accuracy

Statistical methods should do better with more data. That’s essentially what the technical term “consistency” means. But with improper numerical techniques, the the numerical error can increase with more data, overshadowing the decreasing statistical error.

There are three ways Bayesian posterior probability calculations can degrade with more data:

  1. Polynomial approximation
  2. Missing the spike
  3. Underflow

Elementary numerical integration algorithms, such as Gaussian quadrature, are based on polynomial approximations. The method aims to exactly integrate a polynomial that approximates the integrand. But likelihood functions are not approximately polynomial, and they become less like polynomials when they contain more data. They become more like a normal density, asymptotically flat in the tails, something no polynomial can do. With better integration techniques, the integration accuracy will improve with more data rather than degrade.

With more data, the posterior distribution becomes more concentrated. This means that a naive approach to integration might entirely miss the part of the integrand where nearly all the mass is concentrated. You need to make sure your integration method is putting its effort where the action is. Fortunately, it’s easy to estimate where the mode should be.

The third problem is that software calculating the likelihood function can underflow with even a moderate amount of data. The usual solution is to work with the logarithm of the likelihood function, but with numerical integration the solution isn’t quite that simple. You need to integrate the likelihood function itself, not its logarithm. I describe how to deal with this situation in Avoiding underflow in Bayesian computations.

* * *

If you’d like help with statistical computation, let’s talk.

Mental crypto footnotes

I spoke with Manuel Blum this afternoon about his password scheme described here. This post is a few footnotes based on that conversation.

When I mentioned that some people had reacted to the original post saying the scheme was too hard, Blum said that he has taught the scheme to a couple children, 6 and 9 years old, who can use it.

He also said that many people have asked for his slide summarizing the method and asked if I could post it.  You can save the image below to get the full-sized slide.

This slide and my blog post both use a 3-digit password for illustration, though obviously a 3-digit password would be easy to guess by brute force. I asked Blum how long a password using his scheme would need to be so that no one with a laptop would be able to break it. He said that 12 digits should be enough. Note that this assumes the attacker has access to many of your passwords created using the scheme, which would be highly unlikely.

 

Time exchange rate

At some point in the past, computer time was more valuable than human time. The balance changed long ago. While everyone agrees that human time is more costly than computer time, it’s hard to appreciate just how much more costly.

You can rent time on a virtual machine for around $0.05 per CPU-hour. You could pay more or less depending on on-demand vs reserved, Linux vs Windows, etc.

Suppose the total cost of hiring someone — salary, benefits, office space, equipment, insurance liability, etc. — is twice their wage. This implies that a minimum wage worker in the US costs as much as 300 CPUs.

This also implies that programmer time is three orders of magnitude more costly than CPU time. It’s hard to imagine such a difference. If you think, for example, that it’s worth minutes of programmer time to save hours of CPU time, you’re grossly under-valuing programmer time. It’s worth seconds of programmer time to save hours of CPU time.

Open source dissertation

Three cheers for Brent Yorgey! He’s finishing up his dissertation, and he’s posting drafts online, including a GitHub repo of the source.

Cheer 1: He’s not being secretive, fearing that someone will scoop his results. There have been a few instances of one academic scooping another’s research, but these are rare and probably not worth worrying about. Besides, a public GitHub repo is a pretty good way to prove your priority.

Cheer 2: Rather than being afraid someone will find an error, he’s inviting a world-wide audience to look for errors.

Cheer 3: He’s writing a dissertation that someone might actually want to read! That’s not the fastest route to a degree. It’s even actively discouraged in some circles. But it’s generous and great experience.

 

 

 

Software development becoming less mature?

Michael Fogus posted on Twitter this morning

Computing: the only industry that becomes less mature as more time passes.

The immaturity of computing is used to excuse every ignorance. There’s an enormous body of existing wisdom but we don’t care.

I don’t know whether computing is becoming less mature, though it may very well be on average, even if individual developers become more mature.

One reason is that computing is a growing profession, so people are entering the field faster than they are leaving. That lowers average maturity.

Another reason is chronological snobbery, alluded to in Fogus’s second tweet. Chronological snobbery is pervasive in contemporary culture, but especially in computing. Tremendous hardware advances give the illusion that software development has advanced more than it has. What could I possibly learn from someone who programmed back when computers were 100x slower? Maybe a lot.

Related posts:

Bringing bash and PowerShell a little closer together

I recently ran across PSReadLine, a project that makes the PowerShell console act more like a bash shell. I’ve just started using it, but it seems promising. I’m switching between Linux and Windows frequently these days and it’s nice to have a little more in common between the two.

I’d rather write a PowerShell script than a bash script, but I’d rather use the bash console interactively. The PowerShell console is essentially the old cmd.exe console. (I haven’t kept up with PowerShell in a while, so maybe there have been some improvements, but it’s my impression that the scripting language has moved forward and the console has not.) PSReadLine adds some bash-like console conveniences such as Emacs-like editing at the command prompt.

Update: Thanks to Will for pointing out Clink in the comments. Clink sounds like it may be even better than PSReadLine.

PowerShell logo

Haskell is to math as Perl is to English?

Fortran superficially looks like mathematics. Its name comes from “FORmula TRANslation,” and the language does provide a fairly straight-forward way to turn formulas into code. But the similarity to mathematics ends there at the lowest level of abstraction.

Haskell, on the other hand, is more deeply mathematical. Fortran resembles math in the small, but Haskell resembles math in the large. Haskell has mathematical structures at a higher level of abstraction than just formulas. As a recent article put it

While Fortran provides a comfortable translation of mathematical formulas, Haskell code begins to resemble mathematics itself.

On its surface, Perl is one of the least English-like programming languages. It is often criticized as looking like “line noise” because of its heavy use of operators. By contrast, Python has been called “executable pseudocode” because the source is easier to read, particularly for novice programmers. And yet at a deeper level, Perl is more English-like than other programming languages such as Python.

Larry Wall explains in Natural Language Principles in Perl that he designed Perl to incorporate features of spoken language. For example, Perl has analogs of articles and pronouns. (Larry Wall studied both linguistics and computer science in college.) Opinions differ on how well his experiment with incorporating natural language features into programming languages has worked out, but it was an interesting idea.

Related posts:

Where combinator names come from

Today I found out where the one-letter names of some functions in combinatory logic come from. I’d seen these before (for example, in To Mock a Mockingbird) but I had no idea what inspired the names.

These functions — I, K, S, T, and Z — are known as the Schönfinkel combinators, and their names are somewhat mnemonic in German. (Only somewhat. Don’t get your hopes up.)

Definition Name Name origin
λx. x I Identitätsfunktion (identity function)
λx,y. x K Konstanzfunktion (constant function)
λx,y,z. xz(yz) S Verschmelzungsfunktion (amalgamation function)
λx,y,z. xzy T Vertauschungsfunktion (exchange function)
λx,y,z. x(yz) Z Zusammensetzungsfunktion (composition function)

Source: Practical Foundations of Mathematics, footnote on page 89. Available online here.

If you’re not familiar with the notation in the function definitions, see this introduction to lambda calculus.