David Hogg calls conventional statistical notation a “nomenclatural abomination”:

The terminology used throughout this document

enormously overloadsthe symbolp(). That is, we are using, in each line of this discussion, the functionp() to mean something different; its meaning is set by the letters used in its arguments. That is a nomenclatural abomination. I apologize, and encourage my readers to do things that aren’t so ambiguous (like maybe add informative subscripts), but it is so standard in our business that I won’t change (for now).

I found this terribly confusing when I started doing statistics. The meaning is not explicit in the notation but implicit in the conventions surrounding its use, conventions that were foreign to me since I was trained in mathematics and came to statistics later. When I would use letters like *f* and *g* for functions collaborators would say “I don’t know what you’re talking about.” Neither did I understand what they were talking about since they used one letter for everything.

Well, if R and Julia can have multiple dispatch, why not mathematical notation? :)

In particular, I have the feeling that this is true in particular for books about Bayesian statistics.

@Rasmus: It’s more like lambda calculus than R or Julia. Every function is named lambda. s/lambda/p/

I’ve apologized to students for horribly inconsistent notation and terminology in stats, and taken to spending 5-10 minutes noting that stats frequently uses the terms

population

proportion

parameter

probability

and first-time students need to be careful not to confuse them.

The title of this post is almost self-referential. Does that count for anything?

I’m also reminded of Acme::Bleach, and polymorphism.

Who are you collaborating with that doesn’t use f and g for function symbols? (That’s not a rhetorical question – I’m really curious what community that notation is foreign to…)

I also find myself apologizing to students (and cursing the statistics community for not cleaning this up) for having

-N(mu,sigma^2) in books

-dnorm(mean,sigma) in R

-dnorm(mean,precision) in JAGS and WinBUGS

-back to normal(mean,sigma) in Stan (but the BDA3 book also uses N(mu,sigma^2)), so that now JAGS, WinBUGS and Stan are inconsistent.

Especially for beginning students, the fact that in the lecture notes one uses N(mu,sigma^2), but in R one writes dnorm(mean,sigma) causes endless confusion. And then when one starts bayesian modeling, and things rapidly descend into a mess.

I’m so glad someone raises his voice on this…

It’s also the case in thermodynamics where you can see something like F(P,V) and later maybe F(U,T), where F is the same physical quantity (free energy) but a different mathematical function.

The worst I’ve ever seen, in a published paper on Bayesian inference, is this:

“Let’s define the likelihood as : p(theta|x) = p(x|theta)”.

I’ve stared at this for 10 minutes until I fell off my chair (the LHS p here is not a probability distribution over its first argument, theta. I wonder how the author would write the “posterior” probability on theta, knowing x). I agree it’s not exactly the same problem (it’s worse).

I wrote a blog post several years ago on this same topic:

http://lingpipe-blog.com/2009/10/13/whats-wrong-with-probability-notation/

As Michael Collins pointed out in the comments and as I’ve subsequently seen in practice, the probability theorists follow a convention that satisfies notational purists.

A joint probability function (density, mass, or mixed) over random variables X_1,…X_n is written p_{X_1,…,X_n}. A conditional probability function of random variables X_1,…,X_n given random variables Y_1,…,Y_m is written as p_{X_1,…,X_n|Y_1,…,Y_m}. Now you can supply any arguments you want without confusion.

For example, if X and Y are random variables, the first step of deriving Bayes’s rule for X and Y is unambiguously written as

p_{X|Y}(a|b) = p_{Y|X}(b|a) * p_{X}(a) / p_{Y}(b)

even if you use x for a and y for b.

This also clears up event notation for probabilities. So we can define the cumulative distribution function for random variable X as F_X(x) =def= Pr[X < x], and nothing gets confused (unless you're on a board or piece of paper or have poor eyesight and are working with a sans-serif font). It also explains why random variables are written in capitals — to distinguish them from plain old variables.

In applied statistics, it's rather tedious to type all those subscripts, so people tend to use x and y for variables ranging over random variables X and Y, so that p(x|y) is implicitly taken to mean p_{X|Y}(x|y).

Things get even more confusing for learners when you use the convention of Gelman et al.'s

Bayesian Data Analysis (as we do with Stan) and simultaneously drop the random variable subscripts on p and use the same notation x for a random variable and plain-old variable (Gelman argues that it’s problematic for Greek letters like Sigma and trying to capitalize matrices like M). I’ve gotten used to this convention in practice, but we sometimes have to clarify which random variables we’re talking about (as when defining cumulative distribution functions).I wrote a little about the broader issue here with a specific example of functions vs independent variables in coordinate geometry.

As I say there: “It’s interesting how many notational shortcuts mathematicians take would never be tolerated by a programmer.”

Gerry Sussman writes in his book Functional Differential Geometry:

James: It’s a little odd that Function Differential Geometry uses weakly typed code. I suppose you can’t do everything at once, but it would have been interesting to write such a book in Haskell rather than Lisp so that the code had a type system that mirrored the geometry.

How about programming languages, where you can use any symbol you want, as long as it appeared on the IBM Selectric Typewriter. Instead of using ←, →, ≤, ≠, ¬, ∅, ⇒, ∨, ∧, or ⊗, we use , , ||, &&, and ^. This continuing re-use of typewriter characters leads to context-sensitive interpretation, rules for multi-character lexing, and so on.

Seems like every field needs a “time to turn the crank” day.

In teaching, I found it helpful to use Pr( ) for the probability measure, pd( ) for a probability density, pm( ) for a probability mass function and pdm( ) for jointly distributed variables of mixed type. Subscript as needed.