A little Awk

Greg Grothaus posted an article today entitled Why you should know just a little Awk. The article recommends taking a few minutes to learn only the most basic parts of the Awk language. I find this interesting for several reasons.

First, it is impressive what you can accomplish with just a few keystrokes in Awk. The language was designed for file munging and it does this very well. Many people, myself included, think of Perl as a language for file munging. And so it is, but I remember reading something from Larry Wall, creator of Perl, saying that he uses Awk for some tasks.

Second, Grothaus isn’t encouraging people to master the language. He’s saying to just learn a handful of features, at least to start. That goes against my grain. When I learn a language, I want to learn it thoroughly. On the other hand, I don’t have the time or energy lately to learn a new language on top of everything else I have going on. But I think Grothaus has a good point: if you just take a few minutes to learn only how to do several very specific tasks, it could be worth it.

Finally, I found it interesting to read a blog post about a language I haven’t touched in well over a decade. I used Awk in grad school for a little while, and was quite impressed with it. But someone suggested that Perl was similar but even better and I dropped Awk for Perl. Looking back I’d say Perl is more general than Awk, but not necessarily better.  Awk is quite good at the kinds of tasks it was designed for.

I’ve been trying to consolidate the list of programming languages I use after reaching programming language fatigue. Adding yet another language to the list of languages I haven’t mastered but use occasionally would not be progress. But Grothaus’ article tempts me to look at Awk again, not with the intention of mastering it but rather to learn how to do just a small number of things it does remarkably well.

Related posts:

For daily tips on using Unix, follow @UnixToolTip on Twitter.

UnixToolTip twitter icon

Correction re MinGW

Last night I posted an article about a problem caused by the MinGW installer. A little later I retracted the article because I believe I was wrong. Those who visit my blog directly will not see the post, but those who subscribe via RSS still do. I apologize to subscribers and to MinGW for the error.

Apparently deleting a post in WordPress does not delete the corresponding Feedburner post. I am surprised since updating a post in WordPress will update the Feedburner post a few minutes later. I suppose I should have updated the post with a retraction notice rather than deleting it.

Software structural engineers

Billy Hollis made an interesting point in his interview on .NET Rocks. He argues that “structural engineer” is a better analogy than “architect” for the role of “software architects.”

Structural engineers make sure a building can withstand the stresses it will be subjected to. They do not design buildings, though they work closely with the architects who do the design. Hollis says that most software projects do not have an “architect” who is responsible for the external design of the project. Instead they have structural engineers who focus on infrastructure. This is a very important role, but calling these folks “architects” may obscure the lack of someone playing a role analogous to the architect of a construction project.

Related posts:

For a daily dose of computer science and related topics, follow @CompSciFact on Twitter.

CompSciFact twitter icon

Diagram of Bessel function relationships

Bessel functions are interesting and useful, but they’re surrounded by arcane terminology. It may seem that there are endless varieties of Bessel functions, but there are not that many variations and they are simply related to each other.

Each letter in the diagram is taken from the traditional notation for a class of Bessel functions. Functions in the same row satisfy the same differential equation. Functions in the same column have something in common in their naming.

The lines represent simple relations between functions. There’s one line conspicuously missing from the diagram, and that’s no accident.

These notes give the details behind the diagram.

Related post: Visualizing Bessel functions

For daily posts on analysis, follow @AnalysisFact on Twitter.

AnalysisFact twitter icon

Daily tip Twitter account FAQ

This post answers some frequently asked questions regarding my daily tip accounts on Twitter.

How many followers do you have?

About 2800 people are following at least one of these accounts at the time of writing, each following between 2 and 3 accounts on average for a total of about 5900 follows combining all accounts.

How do you schedule your posts?

I schedule the tips a week to a month in advance using HootSuite. Each account posts at the same time each week day starting with SansMouse at 9:30 AM and ending with AlgebraFact at 1:30 PM (US Central Time). Once in a while a tip will go out late due to a Twitter API failure. I occasionally sprinkle in a few unscheduled tweets but I keep the volume low.

What if I don’t use Twitter?

You can subscribe to a Twitter feed via RSS just as you would a blog. For example, you could follow Twitter accounts via Google Reader.

How advanced are these tips?

The SansMouse and RegexTip are in a cycle that starts with the most familiar tips and then moves on to less familiar ones. I mix elementary and advanced material in the mathematical accounts, though there’s a greater range in some accounts. If a tip is more elementary or more advanced than you’d like, you may find something more to your liking in a day or two.

Do you take suggestions? Questions?

I welcome corrections as well as suggestions for new tips. General suggestions are helpful, but I especially appreciate specific tweets.

I like to answer questions when I can, though I can’t respond to every question.

Any plans for new accounts?

I have a couple ideas I’m considering. If you have a suggestion for another daily tip account, let me know.

What about questions about specific accounts?

The following isn’t Q&A format, but it answers some common questions.

SansMouse icon SansMouse is my oldest daily tip account. It gives one Windows keyboard shortcut each day. Most tips apply across all applications, but some tips are specific to popular software packages. Ben Jaffe has an analogous Twitter account commandtab for Mac OS.

RegexTip icon RegexTip is currently the second most popular daily tip account with 1025 followers at the time of writing. This account gives tips for writing regular expressions as well as tips for how regular expressions are used in different environments. I mostly stick to the regular expression syntax supported by Perl 5, Microsoft’s .NET languages, JavaScript, etc.

TeXtip icon TeXtip gives tips for typesetting in TeX and LaTeX. Topics include TeX commands, software for working with LaTeX, tips on typography, etc. I basically stick to LaTeX, though much applies to plain TeX. Also, I don’t say much about add-on packages but stick to the heart of LaTeX.

ProbFact icon ProbFact is currently the most popular daily tip account with 1100 followers. This account gives one fact per day from probability. Probability and statistics are intimately related, but this account primarily sticks to probability proper, though some posts are more statistical. People have suggested I start a statistical counterpart to ProbFact, but I have no plans to do so.

AlgebraFact icon AlgebraFact gives facts from linear algebra, number theory, group theory, etc. A few of the facts are advanced but most are not. I’ve gotten a few complaints for including number theory, but that’s been part of the charter from the beginning. The name AlgebraAndNumberTheoryFact would be more accurate, but too long.

TopologyFact icon TopologyFact gives theorems from topology (point-set topology, algebraic topology, etc.) and geometry. Point-set topology has a lot of theorems that can be condensed into 140 characters but I find other areas harder to tweet about. I could use some help here. I’d like to include more geometry: Euclidean geometry, differential geometry, etc. But I don’t want to include material so esoteric that not many will understand it.

AnalysisFact icon AnalysisFact is the most advanced account on average. Some of the posts are elementary, but some fairly advanced. Topics include real and complex analysis, functional analysis, special functions, differential equations, etc.

I’m doing a drawing to give away either a coffee mug or T-shirt to someone who mentions these tips on Twitter. Tomorrow is the last day to enter.

JohnDCook icon I also have a personal Twitter account. I use this account to post links, interact with friends, etc. I try to keep the signal to noise ratio fairly high, though not as high as the tip accounts.

Micro distractions

Why are long articles easier to read on paper than on a screen? The explanations I’ve heard most often involve resolution or other properties of screens. But the culprit may not be the screen per se. It may be links, notifications, and other distractions.

Obviously if you follow a link you’ll won’t finish reading your original article as quickly (or possibly ever). But even when you don’t follow any links, you have to decide not to follow each link. These decisions are not as obvious a distraction as say constructi0n noise or flickering lights, but they are still distractions and they take a toll. That is the explanation Nicholas Carr gives in his new book The Shallows. (Sorry for the distraction.)

Paper books don’t offer readers many options, and that may be their strength. If you’re aware of things you could do to interact with an e-reader, you have to decide whether to take these actions. E-readers are expected to get better screen technology as well as ads in the near future. The ads may harm reading efficiency more than increased screen resolution will help.

Variations on factorial!

If you’ve heard of factorial, have you heard of double factorial or subfactorial?

Double factorial is written n!!. The factorial of a positive integer n is the product of all positive integers less than or equal to n. The double factorial of n is the product of all integers less than or equal to n that have the same parity.  That is, for an odd number n,  the product defining n!! includes only odd integers and for an even integer n, the product defining n!! includes only even integers. For example, 7!! = 7 × 5 × 3 × 1 and 8!! = 8 × 6 × 4 × 2. By definition, 0!! and -1!! equal 1.

Double factorials often arise in integrals and power series and make it possible to state equations succinctly that would be verbose otherwise. For example,

int_0^{pi/2} sin^{2n+1} theta , dtheta = frac{(2n)!! }{ (2n+1)!!}

It’s possible to define higher factorials or multifactorials. For instance n!!!, the triple factorial of n, is the product of positive integers less than n and congruent to n mod 3. So, for example, 8!!! = 8 × 5 × 2.

Factorials count the number of ways a set can be arranged. A set with n distinct elements can be arranged in n! ways. The number of arrangements that move every element from its original position is the subfactorial of n. Sometimes subfactorial is written with the exclamation point in front of its argument and sometimes it is written with an inverted exclamation point following its argument, i.e.

!n = nmbox{!`}

(By the way, the inverted exclamation mark, used to mark the beginning of an exclamatory sentence in Spanish, is Unicode character U+00A1. You can produce it in HTML with &iexl;. In TeX, you can produce it !` outside of math mode and mbox{!`} in math mode.)

Subfactorial can be computed from the factorial by

 nmbox{!`} = leftlfloor frac{n!}{e} + frac{1}{2} rightrfloor

for positive n where ⌊x⌋ is the greatest integer less than x. The subfactorial of 0 is defined to be 1.

Related posts:

For daily posts on probability, follow @ProbFact on Twitter.

ProbFact twitter icon

Statistical dead end

I get suspicious when I hear people ask about third and fourth moments (skewness and kurtosis). I’ve heard these terms far more often from people who don’t understand statistics than from people who do.

There are two common errors people often have in mind when they bring up skewness and kurtosis.

First, they implicitly believe that distributions can be boiled down to three or four numbers. Maybe they had an elementary statistics course in which everything boiled down to two moments — mean and variance — and they suspect that’s not enough, that advanced statistics extends elementary statistics by looking at third or fourth moments. “There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy.” The path forward is not considering higher and higher moments.

This leads to a second and closely related problem. Interest in third and fourth moments sounds like hearkening back to the moment-matching approach to statistics. Moment matching was a simple idea for estimating distribution parameters:

  1. Set population means equal to sample means.
  2. Set population variances equal to sample variances.
  3. Solve the resulting equations for distribution parameters.

There’s more to moment matching that that, but that’s enough for this discussion. It’s a very natural approach, which is probably why it still persists. But it’s also a statistical dead end.

Moment matching is the most convenient approach to finding estimators in some cases. However, there is another approach to statistics that has largely replaced moment matching, and that’s maximum likelihood estimation: find the parameters that make the data most likely.

Both moment matching and maximum likelihood are intuitively appealing ideas. Sometimes they lead to the same conclusions but often they do not. They competed decades ago and maximum likelihood won. One reason is that maximum likelihood estimators have better theoretical properties. Another reason is that maximum likelihood estimation provides a unified approach that isn’t thwarted by difficulties in solving algebraic equations.

There are good reasons to be concerned about higher moments (including fractional moments) though these are primarily theoretical. For example, higher moments are useful in quantifying the error in the central limit theorem. But there are not a lot of elementary applications of higher moments in contemporary statistics.

Diagram of gamma function identities

The gamma function has a large number of identities relating its values at one point to values at other points. By focusing on just the function arguments and not the details of the relationships, a simple pattern emerges. Most of the identities can be derived from just four fundamental identities:

  • Conjugation
  • Addition
  • Reflection
  • Multiplication

The same holds for functions derived from the gamma function: log gamma, digamma, trigramma, etc.

For the details of the relationships, see Identities for gamma and related functions.

Other diagrams:

For daily posts on analysis, follow @AnalysisFact on Twitter.

AnalysisFact twitter icon

Business literature

Lately I’ve read several writers critical of popular business books. One oft-repeated criticism is that some of the companies featured in Good to Great aren’t doing so great and therefore the book was wrong.

I’ve never looked at business books this way. I see them as literature. They have stories that may provoke your thinking, but they’re not providing scientific laws. I wouldn’t say Good to Great was “wrong” any more than I’d say To Kill a Mockingbird was “wrong.”

The difference, of course, is that novels don’t aspire to an aura of scientific certainty while business books do. Business books often presume to offer universal laws when they only offer anecdotes. That doesn’t mean these books are not valuable. Anecdotes can be quite valuable. However, the value of an anecdote lies not in what it literally conveys but in the thoughts it stirs in your mind.

Business authors sometimes analyze reams of data. I wish they wouldn’t bother. I prefer business writers who don’t pretend to be scientific. “Here’s what I think. Here’s a story that illustrates my point. Your mileage may vary.”

If someone does produce a high-quality study of some class of companies at some point in time, the study is still an anecdote to a reader in different circumstances. A statistically rigorous study of Fortune 500 companies is not directly applicable to someone running a taco stand. It may not even be directly applicable to someone running a Fortune 500 company a few years later.

The taco stand owner may get as much insight from someone’s memoir of running a single large company as from a rigorous study of hundreds of large companies. (He may also get valuable insight from To Kill a Mockingbird.)

P.S. Although I’m saying business books are like literature, I must add that I hate business parables. The ones I’ve read are just terrible. No one would ever read one of these books for its literary merit, and when you strip away the campy prose there isn’t much content left.