Probability function names

For a random variable X and a particular value x, one often needs to compute the probabilities Pr(Xx) and Pr(X > x). It’s surprising how many different approaches software packages take to naming these two functions. I’ll give a few examples here.

It may seem unnecessary to provide software for computing both probabilities since they must sum to 1. However, sometimes you have to compute Pr(X > x) directly because computing Pr(Xx) first and subtracting the result from 1 will not be accurate. See the discussion of erf(x) and erfc(x) here as an example.

I’m accustomed to calling Pr(Xx) the CDF (cumulative distribution function) and Pr(X > x) the CCDF (complementary cumulative distribution function). In numerical libraries I’ve written, I use the function names CDF and CCDF. This seems natural to me, but hardly any software does this.

In Python (SciPy), distribution classes have a method cdf to compute the CDF, and a method sf for the CCDF. (The rationale is that “sf” stands for “survival function.”) Mathematica takes a similar approach with CDF and SurvivalFunction.

R takes a different approach. Instead of distribution objects with standard methods, each function has a name formed by concatenating a prefix for the function type and an abbreviation for the distribution family. For example, pnorm is the CDF of a normal distribution, dnorm is the PDF of a normal distribution, etc. (I find R’s prefixes hard to remember.) Also, R uses the same function for both the CDF and CCDF. By default, pfoo computes the CDF of a distribution abbreviated foo, but if the function has the optional argument lower.tail = FALSE it computes the CCDF.

The Emacs calc module takes an interesting approach, similar to R but more memorable in my opinion. CDF function names begin with ltp (“lower tail probability”) and CCDF function names begin with utp (“upper tail probability”). The final letter of the function name specifies the distribution family: b for binomial, c for chi-square, n for normal, etc. So, for example, the CDF and CCDF of a normal distribution are computed by ltpn and utpn respectively.

I generally prefer APIs with long, self-evident names, but I like the Emacs calculator scheme. Brevity is more important in a calculator than in production code, and the prefixes ltp and utp are easy to remember if you know what they stand for. They’re more symmetric than, for example, Python’s cdf and sf.

Related links:

Distributions in SciPy, Mathematica, R, and Excel.
Emacs calculator

Tagged with: ,
Posted in Uncategorized
4 comments on “Probability function names
  1. Matlab uses a scheme similar to R, where probability functions go {family}{function type}. So, one has normcdf for the CDF of the normal function, and tcdf for that of the Student t distribution, and so on. Surprisingly, a quick look through the documentation of the probability toolbox could not unearth any obvious equivalent for the CCDF, which seems like an unfortunate omission. It does have, however, inverse-CDF functions, such as norminv and tinv: useful for statistical estimation and hypothesis testing, but confusing when looking for CCDFs. :-)

  2. Rick Wicklin says:

    In SAS the names are CDF and SDF. Therea re also separate LOGCDF and LOGSDF functions, which are sometimes more efficient or accurate than computing the distribution and than taking the logarithm. The other esential functions for dealing with distributions are PDF, QUANTILE, and RAND. For details and a discussion that compares these naming scheme with other languages, see “Four essential functions for statistical programmers”

  3. Tom Nish says:

    Dr. Cook,
    How would you name probability function if you were writing such a library from scratch? I ask b/c I’m doing something similar for Perl, built upon GSL’s randist functions. GSL names their sampler, pdf, cdf, ccdf and quantile functions gsl_ran_gaussian, gsl_ran_gaussian_pdf and gsl_ran_gaussian_cdf_P, gsl_ran_gaussian_cdf_Q, gsl_ran_gaussian_cdf_Pinv, and gsl_ran_gaussian_cdf_Qinv, which is quite verbose though consistent.

    Right now, I’m thinking of including a straight GSL-binding, but also provide alternative interfaces mimicking other langauges (like R and octave/matlab, by wrapping gsl_ran_gaussian into rnorm for R and normrnd for octave).

  4. John says:

    Tom: I’ve written a class library in C++ and one in C#. Both use distribution objects with member functions for CDF etc. For example, the NormalDistribution class has methods PDF, LogPDF, CDF, CDFInverse, CCDF, CCDFInverse, Mean, Variance, RandomValue, etc.

    One advantage to this approach is that you can write generic code that takes a distribution class. For example, a unit test might integrate the PDF function and compare the result to the CDF function. If all the distribution classes have a common base class or implement a common abstract interface, you could write this test once and pass it objects representing several different distribution families.