Probability distributions in SciPy

by John on July 20, 2009

Here are some notes on how to work with probability distributions using the SciPy numerical library for Python.

Functions related to probability distributions are located in scipy.stats. The general pattern is

scipy.stats.<distribution family>.<function>

There are 81 supported continuous distribution families and 12 discrete distribution families. Some distributions have obvious names: gamma, cauchy, t, f, etc. The only possible surprise is that all distributions begin with a lower-case letter, even those corresponding to a proper name (e.g. Cauchy). Other distribution names are less obvious: expon for the exponential, chi2 for chi-squared distribution, etc.

Each distribution supports several functions. The density and cumulative distribution functions are pdf and cdf respectively. (Discrete distributions use pmf rather than pdf.) One surprise here is that the inverse CDF function is called ppf for “percentage point function.” I’d never heard that terminology and would have expected something like “quantile.”

Example: scipy.stats.beta.cdf(0.1, 2, 3) evaluates the CDF of a beta(2, 3) random variable at 0.1.

Random values are generated using rvs which takes an optional size argument. The size is set to 1 by default.

Example: scipy.stats.norm.rvs(2, 3) generates a random sample from a normal (Gaussian) random variable with mean 2 and standard deviation 3. The function call scipy.stats.norm.rvs(2, 3, size = 10) returns an array of 10 samples from the same distribution.

The command line help() facility does not document the distribution parameterizations, but the external documentation does. Most distributions are parameterized in terms of location and scale. This means, for example, that the exponential distribution is parameterized in terms of its mean, not its rate. Somewhat surprisingly, the exponential distribution has a location parameter. This means, for example, that scipy.stats.expon.pdf(x, 7) evaluates at x the PDF of an exponential distribution with location 7. This is not what I expected. I assumed there would be no location parameter and that the second argument, 7, would be the mean (scale). Instead, the location was set to 7 and the scale was left at its default value 1. Writing scipy.stats.expon.pdf(x, scale=7) would have given the expected result because the default location value is 0.

SciPy also provides constructors for objects representing random variables.

Example: x = scipy.stats.norm(3, 1); x.cdf(2.7) returns the same value as scipy.stats.norm.cdf(2.7, 3, 1).

Constructing objects representing random variables encapsulates the differences between distributions in the constructors. For example, some distributions take more parameters than others and so their object constructors require more arguments. But once a distribution object is created, its PDF, for example, can be called with a single argument. This makes it easier to write code that takes a general distribution object as an argument.

Related posts:

Numerical computing in IronPython with IronClad
Stand-alone error function erf(x)

{ 7 comments… read them below or add one }

1

Ralf Gommers 07.20.09 at 12:25

The python help() facility does not do the right thing (not sure why) but ipython does:

In [4]: scipy.stats.norm?

gives the same as the html docs.

2

Omer Khalid 07.20.09 at 12:53

A good step in the right direction…

It would be good John if you could expand more on certain use cases. i.e. if one have a sample space of X as list [], then what steps one should take to calculate the mean and standard deviation.

From that moment onwards, your article provides a clear way forward to get PDF/CDF probabilities..

Thanks,
Omer

3

John 07.20.09 at 13:05

Omer: If you just want to compute sample mean and sample standard deviation, you could use the functions scipy.stats.mean and scipy.stats.std. But if you want to fit parameters for non-normal distributions, look at the fit method on the distribution you’re using to describe your data.

4

Omer Khalid 07.20.09 at 13:48

If I understood correct; scipy.stats.mean and scipy.stats.std will be sample’s mean and standard deviation but this is not normal distribution yet. right? Because in your examples you are referring to scipy.stats.norm.* ?

I suppose normal-distribution mean/std.deviation could be calculated through these functions: scipy.stats.norm.mean and scipy.stats.norm.std.

Probably it would be best I guess to first fit the sample set to a distribution and get a mean/std.deviation for it. If this could be the right approach then, would fit function will return a distribution type object which one can use for mean, std and probability?

5

Ralf Gommers 07.20.09 at 13:48

Actually, I just realized what is going on with python help(). The distributions are class instances, and their docstrings are updated at import time to be instance-specific. IPython is smart enough to then do the right thing, the Python interpreter is not – it just displays the class docstring.

6

John 07.20.09 at 14:27

Omer: You can calculate the sample mean and sample variance no matter what distribution your data comes from. However, if your data comes from a normal distribution, the sample mean and sample variance are the best estimators for the population parameters in a technical sense.

Does it look like your data follow a normal distribution? If not, what kind of distribution do they follow? That seems to be the first thing to settle. If you have questions as to the appropriate distribution for your data, perhaps we should continue this discussion via email. See the “About” tab at the top for my contact info.

7

nick 07.21.09 at 11:03

I’ve been meaning to see how stat packages organize their distribution families. I thought it would be exactly this, but didn’t want to be hasty. Thanks so much, now I know the right way to proceed in my application!

Leave a Comment

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>