How to avoid shell scripting

Suppose you know a scripting language (Perl, Python, Ruby, etc) and you’d rather not learn shell scripting (bash, PowerShell, batch, etc.). Or maybe you know shell scripting on one platform and don’t want to take the time right now to learn shell scripting on another platform. For example, maybe you know bash on Linux but don’t want to learn PowerShell on Windows, or vice versa.

One strategy would be to use your preferred language to generate shell scripts. Shell scripts are trivial when they’re just a list of commands: do this, do this, do this, etc. Where shell scripting gets more complicated is when you have variables, branching logic, library calls, all the stuff you already know how to do in another language. Maybe you could do all the complicated logic in your “native language” and just generate a shell script that’s simply as list of instructions with no other logic.

Another strategy is to make system calls from your preferred language. Most scripting languages have a system() function that takes a string and executes it as a system command. The advantage of this approach is that it could have conditional logic that the code generation approach could not handle. The disadvantage is that you have to sort out what process the system() call is running under etc.

Maybe you want to learn shell scripting, but you need to get work done now that you don’t yet know how to do. One of these strategies could buy you some time. You might transition, for example, from Python to PowerShell by generating more sophisticated shell scripts over time and writing simpler generator code until you just write scripts directly.

Calculating entropy

For a set of positive probabilities pi summing to 1, their entropy is defined as

- \sum p_i \log p_i

(For this post, log will mean log base 2, not natural log.)

This post looks at a couple questions about computing entropy. First, are there any numerical problems computing entropy directly from the equation above?

Second, imagine you don’t have the pi values directly but rather counts ni that sum to N. Then pi = ni/N. To apply the equation directly, you’d first need to compute N, then go back through the data again to compute entropy. If you have a large amount of data, could you compute the entropy in one pass?

To address the second question, note that

- \sum \frac{n_i}{N} \log \frac{n_i}{N} = \log N -\frac{1}{N} \sum n_i \log n_i

so you can sum ni and ni log ni in the same pass.

One of the things you learn in numerical analysis is to look carefully at subtractions. Subtracting two nearly equal numbers can result in a loss of precision. Could the numbers above be nearly equal? Maybe if the ni are ridiculously large. Not just astronomically large — astronomically large numbers like the number of particles in the universe are fine — but ridiculously large, numbers whose logarithms approach the limits of machine-representable numbers. (If we’re only talking about numbers as big as the number of particles in the universe, their logs will be at most three-digit numbers).

Now to the problem of computing the sum of ni log ni. Could the order of the terms matter? This also applies to the first question of the post if we look at summing the pi log pi. In general, you’ll get better accuracy summing a lot positive numbers by sorting them and adding from smallest to largest and worse accuracy by summing largest to smallest. If adding a sorted list gives essentially the same result when summed in either direction, summing the list in any other order should too.

To test the methods discussed here, I used two sets of count data, one on the order of a million counts and the other on the order of a billion counts. Both data sets had approximately a power law distribution with counts varying over seven or eight orders of magnitude. For each data set I computed the entropy four ways: two equations times two orders. I convert the counts to probabilities and use the counts directly, and I sum smallest to largest and largest to smallest.

For the smaller data set, all four methods produced the same answer to nine significant figures. For the larger data set, all four methods produced the same answer to seven significant figures. So at least for the kind of data I’m looking at, it doesn’t matter how you calculate entropy, and you might as well use the one-pass algorithm to get the result faster.

Related: Applied information theory

 

New Twitter account for networks

As I mentioned a few days ago, I’m winding down three of my Twitter accounts.

But I have started a new one: NetworkFact. This account has tweets about graph theory, analysis of large networks, etc.

Related links:

List of daily tip accounts

You can create an RSS feed for a Twitter account at RSS4Twitter, though they’re so overloaded that it might not work. You could also use BazQux RSS reader or try some of the alternatives mentioned in the comments here.

Just because we can

From J. Robert Oppenheimer, leader of the Manhattan Project:

When you see something that is technically sweet, you go ahead and do it and argue about what to do about it only after you’ve had your technical success. That is the way it was with the atomic bomb.

 

Hilbert space methods for PDE

When I was in grad school, my advisor asked me to study his out-of-print book, Hilbert Space Methods in Partial Differential Equations. I believe I had a photocopy of a photocopy; I don’t recall ever seeing the original book. I pored over that stack of copies line by line while preparing for my qualifying exams.

Then this evening I was browsing a used book store and was shocked to find a copy of the book, a Dover reprint (ISBN 0486474437).

It was an odd feeling to find what was once a precious and mysterious book available for $5.99 as part of a rag-tag assortment of mostly elementary/popular used math books.

More on PDEs

Giant components and the Lambert W-function

The Lambert W-function is the function w(z) implicitly defined by w exp(w) = z. When I first saw this, I thought that I’d seen something like this come up many times and that it would be really handy to know that this function has a name and has many software implementations [1]. Since then, however, I’ve hardly ever run into the W-function in application. But here’s an application I ran into recently.

A “giant component” in a random network is a component whose size grows in proportion to the size of the network. For a Poisson random network, let c = p(n – 1) be the expected number of vertices connected to any given vertex. If c > 1 then as n goes to infinity, there will be a giant component with probability 1. S, the proportion of nodes inside the a giant component, satisfies the equation

S = 1 – exp(-cS).

S plays a role similar to that of w in the definition of the W-function, which suggests the W-function might come in handy. And in fact, this book gives this solution for S:

S = 1 + w( –c exp(-c) )/c.

There are a couple issues. First, it’s not obvious that solution is correct. Second, we need to be more careful about how we define w(z) when z is negative.

Given some value of c, let S = 1 + w( –c exp(-c) )/c. Then

w( –c exp(-c) ) = –c(1 – S).

Apply the function f(x) = x exp( x ) to both sides of the equation above. By the definition of the W-function, we have

c exp(-c) = –c(1 – S) exp( –c(1 – S) ).

From there it easily follows that

S = 1 – exp(-cS).

Now let’s look more carefully at the definition of the W-function. For positive real z, w exp(w) = z has a unique real solution defining w. But in the calculations above, we have evaluated w at –c exp(-c), which is negative since c > 1. For negative arguments, we have to pick a branch of w. Which branch guarantees that S = 1 – exp(-cS)? Any branch [2]. No matter which solution to w exp(w) = z we pick, the resulting S satisfies the giant component equation. However, the wrong branch might result in a meaningless solution. Since S is a proportion, S should be between 0 and 1.

Let’s go back and say that for our purposes, we will take w(z) to mean the real number no less than -1 satisfying w exp(w) = z. Then for c > 1, the solution S = 1 + w( –c exp(-c) )/c is between 0 and 1. You can show that S = 1 – exp(-cS) has a unique solution for 0 < S < 1, so any other branch would not give a solution in this interval.

Related post: Measuring graph connectivity with eigenvalues

[1] For example, in Python the Lambert W-function is scipy.special.lambertw. The function takes two arguments: z and an integer denoting the branch. By default, the branch argument is set to 0, giving the branch discussed in this post.

[2] For -1/e < z < 0, there are two real branches of w(z), but there are also infinitely many complex branches.

The Drug Book

There’s a new book out in the series that began with The Math Book. The latest in the series is The Drug Book: From Arsenic to Xanax, 250 Milestones in the History of Drugs (ISBN 1402782640).

Like all the books in the series, The Drug Book is a collection of alternating one-page articles and full page color photographs, arranged chronologically. These books make great coffee table books because they’re colorful and easy to dip in and out of. The other books in the series are The Space Book, The Physics Book, and The Medical Book.

The book’s definition of “drug” is a little broad. In addition to medicines, it also includes related chemicals such as recreational drugs and poisons. It also includes articles on drug-related reference works and legislation.

Twitter account turnover

I’m planning to wind down three of my Twitter accounts: ShortcutKeyTip, MusicTheoryTip, and DSP_fact. When the scheduled tweets for these three accounts run out, I won’t post new ones.

On the other hand, I may start a new account. I have a topic in mind, but I don’t know how hard it’ll be to say interesting things about it in 140 characters. If I start a new account I’ll announce it here.

Right now I have 15 accounts. If I close three, I’ll still have a dozen, a baker’s dozen if I add a new account.

Update (August 11, 2013): I decided to start the new account I alluded to in this post: NetworkFact, devoted to networks, graphs, and related topics.