Learn one sed command

Posted on 19 April 2011 by John

You may have seen sed programs even if you didn’t know that’s what they were. In online discussions it’s common to hear someone say

s/foo/bar/

as a shorthand to mean “replace foo with bar.” The line s/foo/bar/ is a complete sed program to do such a replacement.

sed comes with every Unix-like operating system and is available for Windows here. It has a range of features for editing files, but sed is worth using even if you only know how to do one thing with it:

sed "s/pattern1/pattern2/g" file.txt > newfile.txt

This will replace every instance of pattern1 with pattern2 in the file file.txt and will write the result to newfile.txt. The original file file.txt is unchanged.

I used to think there was no reason to use sed when other languages like Python will do everything sed does and much more. Suppose you agree with that. Now suppose you find you often have to make global search-and-replace operations and so you write a script to do this, say a Python script. You’ve got to call your script something, remember what you called it, and put it in your path. How about calling it sed? Or better, don’t write your script, but pretend that you did. If you’re on Linux, it’s already in your path. One advantage of the real sed over your script named sed is that the former can do a lot more, should you ever need it to.

Now for a few details regarding the sed command above. The “s” on the front stands for “substitute” and the “g” on the end stands for “global.” Without the “g” on the end, sed would only replace the first instance of the pattern on each line. If that’s what you want, then remove the “g.”

The patterns inside a sed command are regular expressions, so it’s best to get in the habit of always quoting sed commands. This isn’t necessary for simple string substitutions, but regular expressions often contain characters that you’ll need to prevent the shell from interpreting.

You may find the default regular expression support in sed odd or restrictive. If you’re used to regular expressions in Perl, Python, JavaScript, etc. and you’re using a Gnu implementation of sed, you can add the -r option for more familiar regular expression syntax.

I got the idea for this post from Greg Grouthaus’ post Why you should learn just a little Awk. He makes a good case that you can benefit from learning just a few commands of a language like Awk with no intention to learn more of the language.

Third-system effect

Posted on 18 April 2011 by John

The third-system effect describes a simple system rising like a phoenix out of the ashes of a system that collapsed under its own complexity.

A notorious ‘second-system effect’ often afflicts the successors of small experimental prototypes. The urge to add everything that was left out the first time around all too frequently leads to huge and overcomplicated design. Less well known, because less common, is the ‘third-system effect’: sometimes, after the second system has collapsed of its own weight, there is a chance to go back to simplicity and get it right.

From The Art of Unix Programming by Eric S. Raymond. Available online here.

Raymond says that Unix was such a third system. What are other examples of the third-system effect?

Personal organization software

Posted on 15 April 2011 by John

I’ve tried various strategies and pieces of software for personal organization and haven’t been happy with most of them. I’ll briefly describe my criteria and what I’ve found.

My needs are fairly simple. I don’t need or want something that could scale to running a multinational corporation.

I’d like something with a portable, transparent data format. I don’t want the data stored in a hidden file or in a proprietary format. I’d like to be able to read the data without the software that was used to write it.

I’d like to be as structured or unstructured as I choose and not have to conform to a rigid database schema. I’d like to be able to do ad hoc queries as well as strongly typed queries.

I’d like something that exports to paper easily.

Here’s what I found: org-mode. It’s an Emacs mode for editing text files. It provides sophisticated functionality, but all the sophistication is in the software, not the data format. It’s more convenient to work with org-mode files in Emacs, but the raw file format is just a light-weight mark-down, easy for a person or a computer to parse.

When I went back to using Emacs a year ago after a 15-year hiatus, I heard good things about org-mode but didn’t understand what people liked about it. I heard it described as a to-do list manager and was not impressed. I’m not interested in the features I was first introduced to: tracking the status of to-do items and making agendas. I still don’t use those features. It took me a while to realize that org-mode was what I had been looking for. It was similar in spirit to something I’d thought about writing.

Emacs is an acquired taste. But someone who doesn’t use Emacs could get some good ideas from looking at org-mode. I imagine some people have borrowed its ideas and implemented them for other editors. If not, someone should.

The org-mode site has links to numerous introductions and tutorials. I like the FLOSS Weekly interview with org-mode’s creator Carsten Dominik. In it he explains his motivation for writing org-mode and gives a high-level overview of its features.

Significance testing and Congress

Posted on 14 April 2011 by John

The US Supreme Court’s criticism of significance testing has been in the news lately. Here’s a criticism of significance testing involving the US Congress. Consider the following syllogism.

If a person is an American, he is not a member of Congress.
This person is a member of Congress.
Therefore he is not American.

The initial premise is false, but the reasoning is correct if we assume the initial premise is true.

The premise that Americans are never members of Congress is clearly false. But it’s almost true! The probability of an American being a member of Congress is quite small, about 535/309,000,000. So what happens if we try to salvage the syllogism above by inserting “probably” in the initial premise and conclusion?

If a person is an American, he is probably not a member of Congress.
This person is a member of Congress.
Therefore he is probably not American.

What went wrong? The probability is backward. We want to know the probability that someone is American given he is a member of Congress, not the probability he is a member of Congress given he is American.

Science continually uses flawed reasoning analogous to the example above. We start with a “null hypothesis,” a hypothesis we seek to disprove. If our data are highly unlikely assuming this hypothesis, we reject that hypothesis.

If the null hypothesis is correct, then these data are highly unlikely.
These data have occurred.
Therefore, the null hypothesis is highly unlikely.

Again the probability is backward. We want to know the probability of the hypothesis given the data, not the probability of the data given the hypothesis.

We can’t reject a null hypothesis just because we’ve seen data that are rare under this hypothesis. Maybe our data are even more rare under the alternative. It is rare for an American to be in Congress, but it is even more rare for someone who is not American to be in the US Congress!

I found this illustration in The Earth is Round (p < 0.05) by Jacob Cohen (1994). Cohen in turn credits Pollard and Richardson (1987) in his references.

A magic king’s tour

Posted on 13 April 2011 by John

After posting about a magic square made from knight’s tour, I wondered whether there are magic squares made from a king’s tour. (A king can move one square in any direction. A tour is a sequence of moves that lands on each square of a chess board exactly once.) I found George Jelliss’ site via the comments to that post and found out that there are indeed magic king’s tours. Here’s one published in 1917.

Here’s the path a king would take in the square above:

The knight’s tour magic square had rows and columns that sum to 260, though the diagonals did not. In fact, someone has proved that a knight’s tour on an 8×8 board cannot be diagonally magic. (Thanks John V.)

In the king’s tour above, however, the rows, columns, and diagonals all sum to 260. George Jelliss has posted notes that classify all such magic squares that have biaxial symmetry. See his site for much more information.

How insignificant is statistical significance?

Posted on 13 April 2011 by John

Luis Pericchi sent me a brief note commenting on the recent US Supreme Court decision involving statistical significance and medical reporting. Here is his paper, about a page and a half.

How insignificant is statistical significance? (PDF)

Evaluating weather forecast accuracy: an interview with Eric Floehr

Posted on 12 April 2011 by John

Eric Floehr is the owner of ForecastWatch, a company that evaluates the accuracy of weather forecasts. In this interview Eric explains what his business does, how he got started, and some of the technology he uses.

JC: Let’s talk about your business and how you got started.

EF: I’m a programmer by trade. I got a computer science degree from Ohio State University and took a number of programming jobs, eventually ending up in management.

I’ve also always been interested in weather. A couple years ago my Mom showed me my baby book. At five years old it said “He’s interested in space, dinosaurs, and the weather.” I’m not as interested in dinosaurs now, but still interested in space and the weather.

When I was working as a programmer, and especially when I was a manager, I liked to do little programming projects to learn things. So when I ran across Python I thought about what I could write. I’d wondered whether there was any difference in the accuracy of various weather services—AccuWeather, Weather.gov, etc. Did they use different models, or did they all get their data from the National Weather Service and just package it up differently? So I wrote a little Python web scraper to pull forecasts from various places and compare it with observations. I kept doing that and realized there really were differences between the forecasters.

I didn’t start out for this to be a business. It just started out to satisfy personal curiosity. It just kept growing every year. In my last position before going out on my own I was CTO for a company that made a backup appliance. We got to the point where the product was mature and doing well. ForecastWatch was taking more and more of my time because I was getting more business from it, and so I decided to make the switch. That was March 2010. Revenue doubled over the next year and it looks like this year it will double again. Things are going well and I really enjoy it.

JC: So you hadn’t been doing this that long when we met last year at SciPy in Austin.

EF: No, I’d only been doing this full time for a few months. But I’d been doing this part-time since 2004.

I didn’t have full-time revenue when I was doing this part-time. But it’s amazing. Once you have the time to focus on something, the opportunities that you hadn’t had time to notice before suddenly open up. Just the act of making something your focus almost makes your goal come to fruition. For years you think “too risky, too risky” and then once you make that jump, things fall in place.

JC: So what exactly is the product you sell?

EF: There are two main components. There’s an online component that is subscription-based. It provides monthly aggregated statistics on forecasts versus actual observations. It has absolute errors, min and max errors, Brier score, all kinds of statistics. It evaluates forecasts for precipitation, high and low temperature, opacity, wind speed and direction, etc. Meteriologist use those statistics to evaluate their forecasts to see how they’re doing relative to their peers.

The second component is research reports. Sometimes meteorologists will commission a report to show how well they’re doing. These reports are based on standard, widely-accepted metrics and time-frames, so they can’t just cherry-pick criteria they happened to do well on. But if they see there are statistics in ForecastWatch where they are doing really well, they might want to tell their customers. I’ve also created reports for media companies, large Internet service providers, energy trading companies and other companies who were evaluating weather forecast providers or want some other data analysis related to weather forecasts.

Something else, and I don’t know whether this will become a major component, but another area some people are interested in is historical forecasts. I have agreements with some of the weather forecasting companies to sell their forecasts that are no longer forecasts. Some people find this information valuable. For example, a marketer with a major sports league wanted to know how weather forecasts affected attendance. Another example was an investment manager who was looking to invest in a business whose performance he believed had some correlation with weather forecasts. For example, a ski lodge might want to know how far out people base their decisions on forecasts.

I have this data back to 2004. It’s funny, but most weather forecasting companies historically have not kept their forecasts. Their bread-and-butter is the forecast in the future. Once that future becomes the past, they saw no value in that data until recently.

Incidentally, because I’m monitoring weather forecasters’ websites, I sometimes let them know about errors they were unaware of.

JC: What volume of data are you dealing with?

EF: I have about 200,000,000 forecast data points back to 2004. I’m adding about 130,000 data points a day. My database is something on the order of 70 GB. That’s observation data, hourly forecasts, metadata, etc. Right now I’m looking at data from about 850 locations in the US and about 50 in Canada. I’m looking to expand that both domestically and internationally.

JC: So what kind of technology are you using?

EF: I’m running a LAMP stack: Linux, Apache, MySQL, Python. Originally I was on Red Hat Linux but I’ve switched to Ubuntu server. I’m using Django for the website. Everything is in Python: the scrapers are in Python, the website is in Python, all the administrative back-end is in Python.

There are two websites right now: ForecastWatch.com, which is the subscription, professional site, and a free consumer site ForecastAdvisor.com. The consumer site will give you a local forecast and a measure of the accuracy for various forecasters for your weather.

JC: And who are your customers?

EF: All the major weather forecast companies. Also some financial companies, logistics and transportation companies, etc. I’m just starting to expand more into serving companies that depend on meteorological forecasts whereas in the past I’ve focused directly on meteorologists.

JC: Let’s talk a little more about the entrepreneurial aspect of your business.

EF: Well, for one thing, I don’t think I’d ever have done this if I’d thought about doing it to make money. There’s not an enormous market for this service, but in a way that’s good. I came from a completely technical background. There’s not a marketing or sales gene in my body and I’ve had to learn a lot. ForecastWatch has given me a great opportunity to learn about those non-technical areas of a business that were so foreign to me before.

I got into this entirely for my own use. And I thought that maybe there was already something that did what I wanted, and in the process of trying to find what’s out there I discovered an unmet need. Even though all the major forecasters said that accuracy was the number one thing they were interested in, they weren’t effectively measuring their accuracy. I thought that if I’m interested in this, maybe other people are too.

At first pricing was a mystery to me. Maybe I needed a new laptop, so I’d charge someone the price of a laptop for some analysis. I had to learn the value of my time and my product.

* * *

More interviews
More on entrepreneurship
More on Python

Slide rules

Posted on 11 April 2011 by John

Mike Croucher raises an important point for teachers: Are graphical calculators pointless? I think they are. I resented having to buy my daughter an expensive calculator when I could have bought her a netbook for not much more money.

Calculators are obsolete. I can’t remember the last time I used one. On the other hand, it could be valuable to have students use something really obsolete: a slide rule. Not for long, maybe just for a week or two.

Slide rules are basically strips of log-scale paper. If you play with a slide rule long enough, you might get a tangible feel for logarithms.
Slide rules make you concentrate on orders of magnitude. A slide rule will give you the significant digits, but you have to know what power of ten to use.
Slide rules give you a tangible sense of significant figures. You can’t report more than three significant figures because you can’t see more than three significant figures. Maybe some experience with a slide rule would break students of the habit of reporting ever decimal that comes out of their calculators.

I’m not saying that being able to use a slide rule is a valuable skill. It’s not anymore. But the process of using a slide rule for a little while might teach some skills that are valuable. It would be fine if they forgot how to use a slide rule but retained an intuition for logarithms, orders of magnitude, and significant digits.

I’d recommend using a slide rule in high school for the same reason as using an abacus in elementary school: because it’s tangible, not because it’s practical.

Atomic skills versus molecular skills

Posted on 9 April 2011 by John

Scott Adams has an essay in the Wall Street Journal today entitled How to Get a Real Education. He starts by saying the brightest students should get an academic education and the rest should learn entrepreneurship. I disagree. I don’t see why the choice between a traditional academic education and an education emphasizing entrepreneurship should depend on IQ. I also don’t see why there should be a sharp division between the two. Future professors would do well to learn entrepreneurship and future business owners would do well to learn math and history.

But I want to talk here about what I do agree with Scott Adams on. Here’s my favorite part of his essay.

Combine Skills. The first thing you should learn in a course on entrepreneurship is how to make yourself valuable. It’s unlikely that any average student can develop a world-class skill in one particular area. But it’s easy to learn how to do several different things fairly well. I succeeded as a cartoonist with negligible art talent, some basic writing skills, an ordinary sense of humor and a bit of experience in the business world. The “Dilbert” comic is a combination of all four skills. The world has plenty of better artists, smarter writers, funnier humorists and more experienced business people. The rare part is that each of those modest skills is collected in one person. That’s how value is created.

Academia trains people to think in terms of departments. Achievement is measured in ways that fit into a course catalog: chemistry, French, art, math, history, etc. Those who do the best at the academic game have the hardest time shaking these categories. Someone like Scott Adams could berate himself for not excelling as an artist or a writer. But rather than focusing on these atomic skills, he prides himself on how he combines these skills to do something few could do.

When Adams talks about combining skills, I don’t believe he’s talking about the myth of the Renaissance man. The Renaissance ideal is to be great at several atomic skills, each practiced in isolation. Adams is talking about combining skills that may not be remarkable individually and doing something remarkable.

Words that are primes base 36

Posted on 9 April 2011 by John

This morning on Twitter, Alexander Bogomolny posted a link to his article that gives examples of words that are prime numbers when interpreted as numbers in base 36. Some examples are “Brooklyn”, “paleontologist”, and “deodorant.” (Numbers in base 36 are written using 0, 1, 2, …, 9, A, B, C, …, Z as “digits.” )

Tim Hopper replied with a snippet of Mathematica code that lists all words with up to four letters that correspond to base 36 primes.

Rest[ Flatten[ Union[
    DictionaryLookup /@ IntegerString[
        Table[Prime[n], {n, 1, 300000}], 36]]]]

That made me wonder whether you could estimate how many such words there are without doing an exhaustive search.

The Prime Number Theorem says that the probability of a number less than N being prime is approximately 1/log(N). If we knew how many English words there were of a certain length, then we could guess that 1/log(N) of that those words would be prime when interpreted as base 36 numbers. This assumes that forming an English word and being prime have independent probabilities, which may be approximately true.

How well would our guess have worked on Tim’s example? He prints out all the words corresponding to the first 300,000 primes. The last of these primes is 4,256,233. The exact probability that a number less than that upper limit is prime is then

300,000 / 4,256,233 ≈ 0.07.

There are about 4200 English words with four or fewer letters. (I found this out by running

grep -ciE '^[a-z]{1,4}$'

on the words file on a Linux box. See similar tricks here.) If we estimate that 7% of these are prime, we’d expect 294 words from Tim’s program. His program produces 275 words, so our prediction is pretty good.

If we didn’t know the exact probability of a number in our range being prime, we could have estimated the probability at

1/log(4,256,233) ≈ 0.0655

using the Prime Number Theorem. Using this approximation we’d estimate 4200*0.0655 = 275.1 words; our estimate would be exactly correct! There’s good reason to believe our estimate would be reasonably close, but we got lucky to get this close.

Month: April 2011

Learn one sed command

More regular expression posts

Third-system effect

Related posts

Personal organization software

Related posts

Significance testing and Congress

More statistics posts

A magic king’s tour

How insignificant is statistical significance?

Evaluating weather forecast accuracy: an interview with Eric Floehr

Slide rules

Related posts

Atomic skills versus molecular skills

Related posts

Words that are primes base 36

More prime number posts