For a random variable *X* and a particular value *x*, one often needs to compute the probabilities Pr(*X* ≤ *x*) and Pr(*X* > *x*). It’s surprising how many different approaches software packages take to naming these two functions. I’ll give a few examples here.

It may seem unnecessary to provide software for computing both probabilities since they must sum to 1. However, sometimes you have to compute Pr(*X* > *x*) directly because computing Pr(*X* ≤ *x*) first and subtracting the result from 1 will not be accurate. See the discussion of erf(x) and erfc(x) here as an example.

I’m accustomed to calling Pr(*X* ≤ *x*) the CDF (cumulative distribution function) and Pr(*X* > *x*) the CCDF (complementary cumulative distribution function). In numerical libraries I’ve written, I use the function names `CDF`

and `CCDF`

. This seems natural to me, but hardly any software does this.

In Python (SciPy), distribution classes have a method `cdf`

to compute the CDF, and a method `sf`

for the CCDF. (The rationale is that “sf” stands for “survival function.”) Mathematica takes a similar approach with `CDF`

and `SurvivalFunction`

.

R takes a different approach. Instead of distribution objects with standard methods, each function has a name formed by concatenating a prefix for the function type and an abbreviation for the distribution family. For example, `pnorm`

is the CDF of a normal distribution, `dnorm`

is the PDF of a normal distribution, etc. (I find R’s prefixes hard to remember.) Also, R uses the same function for both the CDF and CCDF. By default, `pfoo`

computes the CDF of a distribution abbreviated `foo`

, but if the function has the optional argument `lower.tail = FALSE`

it computes the CCDF.

The Emacs `calc`

module takes an interesting approach, similar to R but more memorable in my opinion. CDF function names begin with `ltp`

(“lower tail probability”) and CCDF function names begin with `utp`

(“upper tail probability”). The final letter of the function name specifies the distribution family: `b`

for binomial, `c`

for chi-square, `n`

for normal, etc. So, for example, the CDF and CCDF of a normal distribution are computed by `ltpn`

and `utpn`

respectively.

I generally prefer APIs with long, self-evident names, but I like the Emacs calculator scheme. Brevity is more important in a calculator than in production code, and the prefixes `ltp`

and `utp`

are easy to remember if you know what they stand for. They’re more symmetric than, for example, Python’s `cdf`

and `sf`

.

**Related links**:

Distributions in SciPy, Mathematica, R, and Excel.

Emacs calculator

Matlab uses a scheme similar to R, where probability functions go

`{family}{function type}`

. So, one has`normcdf`

for the CDF of the normal function, and`tcdf`

for that of the Studenttdistribution, and so on. Surprisingly, a quick look through the documentation of the probability toolbox could not unearth any obvious equivalent for the CCDF, which seems like an unfortunate omission. It does have, however, inverse-CDF functions, such as`norminv`

and`tinv`

: useful for statistical estimation and hypothesis testing, but confusing when looking for CCDFs.In SAS the names are CDF and SDF. Therea re also separate LOGCDF and LOGSDF functions, which are sometimes more efficient or accurate than computing the distribution and than taking the logarithm. The other esential functions for dealing with distributions are PDF, QUANTILE, and RAND. For details and a discussion that compares these naming scheme with other languages, see “Four essential functions for statistical programmers”

Dr. Cook,

How would you name probability function if you were writing such a library from scratch? I ask b/c I’m doing something similar for Perl, built upon GSL’s randist functions. GSL names their sampler, pdf, cdf, ccdf and quantile functions gsl_ran_gaussian, gsl_ran_gaussian_pdf and gsl_ran_gaussian_cdf_P, gsl_ran_gaussian_cdf_Q, gsl_ran_gaussian_cdf_Pinv, and gsl_ran_gaussian_cdf_Qinv, which is quite verbose though consistent.

Right now, I’m thinking of including a straight GSL-binding, but also provide alternative interfaces mimicking other langauges (like R and octave/matlab, by wrapping gsl_ran_gaussian into rnorm for R and normrnd for octave).

-Tom

Tom: I’ve written a class library in C++ and one in C#. Both use distribution objects with member functions for CDF etc. For example, the

`NormalDistribution`

class has methods`PDF`

,`LogPDF`

,`CDF`

,`CDFInverse`

,`CCDF`

,`CCDFInverse`

,`Mean`

,`Variance`

,`RandomValue`

, etc.One advantage to this approach is that you can write generic code that takes a distribution class. For example, a unit test might integrate the PDF function and compare the result to the CDF function. If all the distribution classes have a common base class or implement a common abstract interface, you could write this test once and pass it objects representing several different distribution families.