Wrong and unnecessary

Posted on 7 July 2012 by John

David Hogg on linear regression:

… in almost all cases in which scientists fit a straight line to their data, they are doing something that is simultaneously wrong and unnecessary. It is wrong because … linear relationship is exceedingly rare.

Even if the investigator doesn’t care that the fit is wrong, it is likely to be unnecessary. Why? Because it is rare that … the important result … is the slope and intercept of a best-fit line! Usually the full distribution of data is much more rich, informative, and important than any simple metrics made by fitting an overly simple model.

That said, it must be admitted that one of the most effective ways to communicate scientific results is with catchy punchlines and compact, approximate representations, even when those are unjustified and unnecessary.

Related post: Responsible data analysis

Responsible data analysis

Posted on 5 July 2012 by John

David Hogg on responsible data analysis:

The key idea is that the result of responsible data analysis is not an answer but a distribution over answers. Data are inherently noisy and incomplete; they never answer your question precisely. So no single number … will adequately represent the result of a data analysis. Results are always pdfs (or full likelihood functions); we must embrace that.

Emacs key bindings in Visual Studio

Posted on 4 July 2012 by John

I just found out there is a Visual Studio 2010 extension that provides basic Emacs key bindings. This looks fantastic. I’m not looking for much, just the most basic editing and navigation functions, and they seem to be there.

When I first started programming for Windows, I was used to Emacs, and for a while I tried to keep using Emacs. But the impedance mismatch between Visual Studio and Emacs was too much. The benefits of Visual Studio outweighed the familiarity of Emacs and I quit using Emacs. Years later I started using Emacs again, deciding to just live with the impedance mismatch. It would be nice to reduce this mismatch.

So far I like the extension. It adds some Emacs bindings without interfering with familiar Visual Studio operation, except when the two are in direct conflict and it has to choose one or the other.

There is one major bug:

Cut/copy/paste from other applications into Visual Studio does not work with the Emacs extension installed. We’re working on a fix for this issue and will post an updated version of the extension when a fix is available.

Cut and paste works fine within Visual Studio. So if you want to copy code from a file, you need to open that file in Visual Studio first. Maybe that will turn out to be a show-stopper. On the other hand, I do basic editing far more often than I copy from external sources, so maybe the bug is worth tolerating.

Why are remote controls so awful?

Posted on 3 July 2012 by John

A recent Slate article asks How did the remote control get so awful and confusing? It gives some history of remote controls, but it doesn’t give a satisfying explanation of how remote controls became so complex or why they remain complex.

I argued a few days ago that simplicity is hard to sell, unless the status quo is overwhelmingly complex. But here is situation that many people find irritatingly complex and yet nobody has produced a simple alternative. (There are so-called universal remotes, but they’re not universal and they’re not simple.) Why do you think this is?

Probability function names

Posted on 2 July 2012 by John

For a random variable X and a particular value x, one often needs to compute the probabilities Pr(X ≤ x) and Pr(X > x). It’s surprising how many different approaches software packages take to naming these two functions. I’ll give a few examples here.

It may seem unnecessary to provide software for computing both probabilities since they must sum to 1. However, sometimes you have to compute Pr(X > x) directly because computing Pr(X ≤ x) first and subtracting the result from 1 will not be accurate. See the discussion of erf(x) and erfc(x) here as an example.

I’m accustomed to calling Pr(X ≤ x) the CDF (cumulative distribution function) and Pr(X > x) the CCDF (complementary cumulative distribution function). In numerical libraries I’ve written, I use the function names CDF and CCDF. This seems natural to me, but hardly any software does this.

In Python (SciPy), distribution classes have a method cdf to compute the CDF, and a method sf for the CCDF. (The rationale is that “sf” stands for “survival function.”) Mathematica takes a similar approach with CDF and SurvivalFunction.

R takes a different approach. Instead of distribution objects with standard methods, each function has a name formed by concatenating a prefix for the function type and an abbreviation for the distribution family. For example, pnorm is the CDF of a normal distribution, dnorm is the PDF of a normal distribution, etc. (I find R’s prefixes hard to remember.) Also, R uses the same function for both the CDF and CCDF. By default, pfoo computes the CDF of a distribution abbreviated foo, but if the function has the optional argument lower.tail = FALSE it computes the CCDF.

The Emacs calc module takes an interesting approach, similar to R but more memorable in my opinion. CDF function names begin with ltp (“lower tail probability”) and CCDF function names begin with utp (“upper tail probability”). The final letter of the function name specifies the distribution family: b for binomial, c for chi-square, n for normal, etc. So, for example, the CDF and CCDF of a normal distribution are computed by ltpn and utpn respectively.

I generally prefer APIs with long, self-evident names, but I like the Emacs calculator scheme. Brevity is more important in a calculator than in production code, and the prefixes ltp and utp are easy to remember if you know what they stand for. They’re more symmetric than, for example, Python’s cdf and sf.

Rethinking WYSIWYG

Posted on 1 July 2012 by John

Word processors such as Microsoft Word are said to be WYSIWYG: what you see is what you get. In a sense that’s true, but in another sense markup languages such as HTML or LaTeX are really WYSIWYG.

With WYSIWYG programs, what you see is what you will get visually, if all goes well. If you think of the computer file as simply an intermediary between your keystrokes and paper coming out of a printer, the paper is “what you get.”

But more fundamentally, what you get when you edit a file is a file. And the relationship between your keystrokes and the changes in the file could be quite obscure. With text files, such as files containing source code, what you see is what you get in the sense that the characters you see in your editor correspond directly to the contents of the file.

Sometimes I’m quite happy to be ignorant of how my keystrokes correspond to file contents. When I’m cropping a photo, for example, I’m grateful that I have a visual interface and can be safely ignorant of the layout of bytes in a file. But for other tasks, text files are simpler because there are no mysterious forces at work: what you see really is what you get.

Month: July 2012