A surprise with Emacs and Office 2007

I had a little surprise when I tried to open an Excel file from Emacs. I was using dired, a sort of file explorer inside Emacs. I expected one of two things to happen. Maybe Emacs would know to launch the file using Excel. Or maybe it would open the binary file as a bunch of gibberish. Instead I got something like this:

screenshot

This was disorienting at first. Then I thought about how Office 2007 documents are zipped XML files. But how does dired know that my .xlsx file is a zip file? I suppose since Emacs is a Unix application at heart, it’s acting like a Unix application even though I was using it on Windows. It’s determining the file type by inspected the file itself and ignoring the file extension.

(Office 2007 documents are not entirely XML. The data and styling directives are XML, but embedded files are just files. The spreadsheet in this example contains a large photo. That photo is a JPEG file that gets zipped up with all the XML files that make up the spreadsheet.)

So I learned that Emacs knows how to navigate inside zip files, and that a convenient way to poke around inside an Office 2007 file is to browse into it from Emacs.

Here’s another post that discusses Emacs and Office 2007, contrasting their interface designs: Clutter-discoverability trade-off

Impure math

When Samuel Hansen said in his interview “You’re not a pure mathematician” I agreed without thinking, but later the statement bothered me a little. I know what he meant: considering the two categories of pure math and applied math, you’d put yourself in the latter category. Which is true.

But the term “pure” math can be misleading, as if everyone else does impure math. Applied math is not an alternative to theoretical math. Applied mathematicians prove theorems etc. We work on applications in addition to doing what is expected of pure mathematicians. The difference between pure and applied math is motivation, not content. Applied math is motivated by direct application to non-mathematical problems. Pure math seeks to advance math for its own sake. Both are important.

Statistics uses the terms “theoretical” and “applied” rather than “pure” and “applied.” Math doesn’t use “theoretical” as an antithesis to “applied” because applied math is theoretical. But unlike math, being “applied” in statistics does mean you’re often (too often?) excused from proving theorems. The first time I was a coauthor on a statistics paper I was surprised to find out you could publish with just simulation results and no theorems. This happens in applied math as well, but not nearly as often as it does in applied statistics.

On the other hand, when I hear the term “applied statistics” I want to ask “Is there any other kind?” Statistics is applied (and theoretical!) though some statisticians work more directly on applications than others. As Andrew Gelman quips, the difference between theoretical and applied statisticians is that

The theoretical statistician uses x, the applied statistician uses y (because we reserve x for predictors).

I assume that statement wasn’t meant to be taken literally, but I agree with the sentiment that the distinction between theoretical and applied statistics can be exaggerated. I’d say the same applies to pure and applied math.

Manga guides to physics and the universe

I recently received review copies of the Manga Guides to physics and the universe. These made a better impression than the relativity guide that I reviewed earlier. The guide to physics has been out for a while. The guide to the universe comes out June 24.

The Manga Guide to Physics basically covers force, momentum, and energy. The pace is leisurely. There’s not much back story; it cuts to the chase fairly quickly.This guide will not prepare you to solve physics problems, but it does give you a good overview of the basics.

(These books are not entirely manga; all three books I’ve seen in the series have several pages of more traditional textbook content.)

The Manga Guide to the Universe gives a tour of cosmology from the geocentric view to current theories. It contains some very recent material, such as references to the WMAP project.

This book is more rushed than the physics guide. That’s to be expected considering its ambitious scope. It devotes a fairly large amount of space to the back story and this contributes to the book being rushed.

I mentioned in my review of The Manga Guide to Relativity that although Americans associate cartoons with children, that book was not written for children. The physics guide, however, would be appropriate for a wide range of readers. Young readers may not fully appreciate the content, but they would not find anything offensive.

The Manga Guide to the Universe is inoffensive with one exception. There are a couple provocative frames in the prologue that will keep the book off some school library shelves.

Why do C++ folks make things so complicated?

This morning Miroslav Bajtoš asked “Why do C++ folks make things so complicated?” in response to my article on regular expressions in C++. Other people asked similar questions yesterday.

My response has two parts:

  1. Why I believe C++ libraries are often complicated.
  2. Why I don’t think it has to be that way.

Why would someone be using C++ in the first place? Most likely because they need performance or fine-grained control that they cannot get somewhere else. A Ruby programmer, for example, can make a design decision that makes code 10% slower but much easier to use. “Hey, if you want the best performance possible, why are you using Ruby? Didn’t you come here because you wanted convenience?” But the C++ programmer can’t say that. It’s not turtles all the way down. Often C++ is the last human-generated language in a technology stack before you hit metal.

This weekend The Register quoted Herb Sutter saying “The world is built on C++.” The article goes on to mention some of the foundational software written in C++.

Apple’s Mac OS X, Adobe Illustrator, Facebook, Google’s Chrome browser, the Apache MapReduce clustered data-processing architecture, Microsoft Windows 7 and Internet Explorer, Firefox, and MySQL—to name just a handful— are written in part or in their entirety with C++.

Certainly there is a lot of software implemented in higher-level languages, but those high-level languages are almost always implemented in C or C++. When there’s no lower-level language to appeal to, you have to offer a lot of options, even if 90% of users won’t need those options.

On the other hand, that doesn’t mean all C++ libraries have to be complicated. The argument above says that the lowest layers have to be complicated and they’re written in C++. But why couldn’t the next layer up also be written in C++?

Some time in the 90’s I ran across an article called “Top C++.” I’ve tried unsuccessfully since then to find a copy of  it. As I recall, the article proposed dividing C++ conceptually into two languages: Top C++ and Bottom C++. Explicit memory management, for example, would be in Bottom C++. Top C++ would be a higher-level language. You could, for example, tell a compiler that you intend to write Top C++ and it could warn you if you use features designated as Bottom C++.

Of course you could slip from Top C++ into Bottom C++ if you really needed to, and that’s the beauty of using one language for high-level and low-level code. Application code in C++ would use Top C++ features by default, but could deliberately drop down a level when necessary. A section of Bottom C++ could be marked with comments or compiler pragmas and justified in a code review. Instead of having to cross a language barrier, you simply cross a convention barrier.

I thought this was a great idea. I’ve followed this approach throughout my career, writing high-level C++ on top of low-level C++ libraries. To some extent I put on my Bottom C++ hat when I’m writing library code and my Top C++ hat when I’m writing applications.

But most of the world has decided that’s not the way to go. If you’re writing C++, it might as well be Bottom C++. Instead of Top C++, write VB or some other higher-level language. There’s very little interest in high-level C++.

I imagine this decision has more to do with management than technology. It’s easier to have people either write C++ or not. If you’re writing C++, then use any language feature any time. I suppose this is a stable equilibrium and that the Top/Bottom C++ partition is not.

More programming posts

Have you saved a milliwatt today?

Research In Motion (RIM) is best known for making the BlackBerry. In the early days of the company, RIM focused on reducing the BlackBerry’s power consumption. The engineers put up a sign:

Have you saved a milliwatt today?

This was a specific, reasonable challenge. Instead of some nebulous exhortation to corporate greatness, something worthy of a Dilbert cartoon, they asked engineers to reduce power consumption by a milliwatt.

What’s your equivalent of saving a milliwatt?

Related post: Don’t try to be God, try to be Shakespeare

Bundled versus unbundled version history

The other day I said to a colleague that an advantage to LaTeX over Microsoft Word is that it’s easy to version LaTeX files because they’re just plain text. My colleague had the opposite view. He said that LaTeX was impossible to version because its files are just plain text. How could we interpret the same facts so differently?

I was thinking about checking files in and out of a version control system. With a text file, the version control system can tell you exactly how two versions differ. But with something like a Word document, the system will give an unhelpful message like “binary files differ.”

My colleague was thinking about using the change tracking features of Microsoft Word. He’s accustomed to seeing documents in isolation, such as a file attachment in an email. In that setting, a plain text file has no version history, but a Word document may.

I assumed version information would be external to the document. He assumed the version information would be bundled with the document. My view is typical of software developers. His is typical of everyone else.

These two approaches are analogous to functional programming versus object oriented programming. Version control systems have a functional view of files. The versioning functionality is unbundled from the file content, in part because the content (typically source code files) could be used by many different applications. Word provides a sort of object oriented versioning system, bundling versioning functionality with the data.

As with functional versus object oriented programming, there’s no “right” way to solve this problem, only approaches that work better in different contexts. I much prefer using a version control system to track changes to my files, but that approach won’t fly with people who don’t share a common version control system or don’t use version control at all.

Related posts

Simpler version of Stirling’s approximation

Here’s an approximation for n! similar to Stirling’s famous approximation.

n! \approx \left( \frac{n}{e} \right)^n

I ran into this reading A View from the Top by Alex Iosevich. It is less accurate than Stirling’s formula, but has three advantages.

  1. It contains the highest order terms of Stirling’s formula.
  2. It’s easy to derive.
  3. It’s easy to remember.

One way to teach Stirling’s formula would be to teach this one first as a warm-up.

I’ll show that the approximation above gives a lower bound on n!. It gives a good approximation to the extent that the inequalities don’t give up too much.

\log n! &=& \sum_{k=1}^n \log k \&geq& \int_1^n \log x \, dx \&=& n \log n - n + 1 \&>& n \log n - n

Exponentiating both sides shows

n! > \left( \frac{n}{e} \right)^n

The derivation only uses techniques commonly taught in second-semester calculus classes: using integrals to estimate sums and integration by parts.

Deriving Stirling’s approximation

n! \approx \sqrt{2\pi n} \left( \frac{n}{e} \right)^n

requires more work, but remembering the result is just a matter of adding an extra factor to the simpler approximation.

As I noted up front, the advantage of the approximation discussed here is simplicity, not accuracy. For more sophisticated approximation, see Two useful asymptotic series for how to compute the logarithm of the gamma function.

Update:

The first comment asked whether you tweak the derivation to get a simple upper bound on factorial. Yes you can.

log n! &=& sum_{k=1}^n log k \ &leq&log n + int_1^n log x , dx \ &=& (n+1) log n - n + 1

This leads to the upper bound

n! leq ne left( frac{n}{e} right)^n

Wasteful compared to Stirling, but still simple to derive and remember.

Related posts

Mental context switches are evil

This week I’ve run across two examples of technical strategies to reduce mental context switches.

The first example is Pete Kruckenberg’s story of why his team chose to develop a web application using node.js even though he had extensive experience in Perl/Apache. One of his arguments is that since the client-side code has to be written in JavaScript, it saves mental energy to use JavaScript for the server-side code as well. Even for someone who knows JavaScript and Perl well, it takes mental energy to oscillate between the two languages.

There’s much more to node.js than its use of JavaScript. It requires a very different approach to developing web applications. Kruckenberg and his colleagues had to weigh multiple factors in deciding their development framework. But it may not be too much of a simplification to say they chose a big, one-time mental context switch—the node.js style of development—in order to avoid countless daily context switches between JavaScript and Perl.

The second example is Ryan Barrett’s blog post Why I run shells inside Emacs.

Mental context switches are evil. Some people handle them better than others, but not me. I handle them worse. I feel it, almost physically, when context switches whittle down my precious flow piece by piece.

Barrett explains that he first tried configuring Emacs and his shell to be similar, but the very similarity made the context switch more irritating.

Even something as small as switching between Emacs and the shell can hurt, especially when I do it thousands of times a day. … What’s worse, only some parts are different, so my muscle memory is constantly seduced into a false sense of familiarity, only to have the rug yanked out from under it seconds later.

Both examples highlight the cost of context switches. Neither Kruckenberg nor Barrett mentioned the cost of learning two contexts. Instead both focused on the cost switching between two contexts. Novices might understandably want to avoid having to learn two similar tools, but these stories were from men who had learned two tools and wanted to avoid oscillating between them.

My favorite line from Barrett is “context switches whittle down my precious flow piece by piece.” I’ve felt little distractions erode my ability to concentrate but hadn’t expressed that feeling in words.

It’s no exaggeration to call flow “precious.” Productivity can easily vary by an order of magnitude depending on whether you’re in the zone. It may sound fussy to try to eliminate minor distractions, but if these distractions make it harder to get into the zone, they’re worth eliminating.

Related posts