For a random variable X and a particular value x, one often needs to compute the probabilities Pr(X ≤ x) and Pr(X > x). It’s surprising how many different approaches software packages take to naming these two functions. I’ll give a few examples here.
It may seem unnecessary to provide software for computing both probabilities since they must sum to 1. However, sometimes you have to compute Pr(X > x) directly because computing Pr(X ≤ x) first and subtracting the result from 1 will not be accurate. See the discussion of erf(x) and erfc(x) here as an example.
I’m accustomed to calling Pr(X ≤ x) the CDF (cumulative distribution function) and Pr(X > x) the CCDF (complementary cumulative distribution function). In numerical libraries I’ve written, I use the function names CDF
and CCDF
. This seems natural to me, but hardly any software does this.
In Python (SciPy), distribution classes have a method cdf
to compute the CDF, and a method sf
for the CCDF. (The rationale is that “sf” stands for “survival function.”) Mathematica takes a similar approach with CDF
and SurvivalFunction
.
R takes a different approach. Instead of distribution objects with standard methods, each function has a name formed by concatenating a prefix for the function type and an abbreviation for the distribution family. For example, pnorm
is the CDF of a normal distribution, dnorm
is the PDF of a normal distribution, etc. (I find R’s prefixes hard to remember.) Also, R uses the same function for both the CDF and CCDF. By default, pfoo
computes the CDF of a distribution abbreviated foo
, but if the function has the optional argument lower.tail = FALSE
it computes the CCDF.
The Emacs calc
module takes an interesting approach, similar to R but more memorable in my opinion. CDF function names begin with ltp
(“lower tail probability”) and CCDF function names begin with utp
(“upper tail probability”). The final letter of the function name specifies the distribution family: b
for binomial, c
for chi-square, n
for normal, etc. So, for example, the CDF and CCDF of a normal distribution are computed by ltpn
and utpn
respectively.
I generally prefer APIs with long, self-evident names, but I like the Emacs calculator scheme. Brevity is more important in a calculator than in production code, and the prefixes ltp
and utp
are easy to remember if you know what they stand for. They’re more symmetric than, for example, Python’s cdf
and sf
.
Related links
- Distributions in SciPy, Mathematica, R, and Excel.
- Emacs calculator
Matlab uses a scheme similar to R, where probability functions go
{family}{function type}
. So, one hasnormcdf
for the CDF of the normal function, andtcdf
for that of the Student t distribution, and so on. Surprisingly, a quick look through the documentation of the probability toolbox could not unearth any obvious equivalent for the CCDF, which seems like an unfortunate omission. It does have, however, inverse-CDF functions, such asnorminv
andtinv
: useful for statistical estimation and hypothesis testing, but confusing when looking for CCDFs. :-)In SAS the names are CDF and SDF. Therea re also separate LOGCDF and LOGSDF functions, which are sometimes more efficient or accurate than computing the distribution and than taking the logarithm. The other esential functions for dealing with distributions are PDF, QUANTILE, and RAND. For details and a discussion that compares these naming scheme with other languages, see “Four essential functions for statistical programmers”
Dr. Cook,
How would you name probability function if you were writing such a library from scratch? I ask b/c I’m doing something similar for Perl, built upon GSL’s randist functions. GSL names their sampler, pdf, cdf, ccdf and quantile functions gsl_ran_gaussian, gsl_ran_gaussian_pdf and gsl_ran_gaussian_cdf_P, gsl_ran_gaussian_cdf_Q, gsl_ran_gaussian_cdf_Pinv, and gsl_ran_gaussian_cdf_Qinv, which is quite verbose though consistent.
Right now, I’m thinking of including a straight GSL-binding, but also provide alternative interfaces mimicking other langauges (like R and octave/matlab, by wrapping gsl_ran_gaussian into rnorm for R and normrnd for octave).
-Tom
Tom: I’ve written a class library in C++ and one in C#. Both use distribution objects with member functions for CDF etc. For example, the
NormalDistribution
class has methodsPDF
,LogPDF
,CDF
,CDFInverse
,CCDF
,CCDFInverse
,Mean
,Variance
,RandomValue
, etc.One advantage to this approach is that you can write generic code that takes a distribution class. For example, a unit test might integrate the PDF function and compare the result to the CDF function. If all the distribution classes have a common base class or implement a common abstract interface, you could write this test once and pass it objects representing several different distribution families.