Probability distributions in SciPy

Here are some notes on how to work with probability distributions using the SciPy numerical library for Python.

Functions related to probability distributions are located in scipy.stats. The general pattern is

scipy.stats.<distribution family>.<function>

There are 81 supported continuous distribution families and 12 discrete distribution families. Some distributions have obvious names: gamma, cauchy, t, f, etc. The only possible surprise is that all distributions begin with a lower-case letter, even those corresponding to a proper name (e.g. Cauchy). Other distribution names are less obvious: expon for the exponential, chi2 for chi-squared distribution, etc.

Each distribution supports several functions. The density and cumulative distribution functions are pdf and cdf respectively. (Discrete distributions use pmf rather than pdf.) One surprise here is that the inverse CDF function is called ppf for “percentage point function.” I’d never heard that terminology and would have expected something like “quantile.”

Example: scipy.stats.beta.cdf(0.1, 2, 3) evaluates the CDF of a beta(2, 3) random variable at 0.1.

Random values are generated using rvs which takes an optional size argument. The size is set to 1 by default.

Example: scipy.stats.norm.rvs(2, 3) generates a random sample from a normal (Gaussian) random variable with mean 2 and standard deviation 3. The function call scipy.stats.norm.rvs(2, 3, size = 10) returns an array of 10 samples from the same distribution.

The command line help() facility does not document the distribution parameterizations, but the external documentation does. Most distributions are parameterized in terms of location and scale. This means, for example, that the exponential distribution is parameterized in terms of its mean, not its rate. Somewhat surprisingly, the exponential distribution has a location parameter. This means, for example, that scipy.stats.expon.pdf(x, 7) evaluates at x the PDF of an exponential distribution with location 7. This is not what I expected. I assumed there would be no location parameter and that the second argument, 7, would be the mean (scale). Instead, the location was set to 7 and the scale was left at its default value 1. Writing scipy.stats.expon.pdf(x, scale=7) would have given the expected result because the default location value is 0.

SciPy also provides constructors for objects representing random variables.

Example: x = scipy.stats.norm(3, 1); x.cdf(2.7) returns the same value as scipy.stats.norm.cdf(2.7, 3, 1).

Constructing objects representing random variables encapsulates the differences between distributions in the constructors. For example, some distributions take more parameters than others and so their object constructors require more arguments. But once a distribution object is created, its PDF, for example, can be called with a single argument. This makes it easier to write code that takes a general distribution object as an argument.

11 thoughts on “Probability distributions in SciPy”

Omer Khalid

20 July 2009 at 12:53

A good step in the right direction…

It would be good John if you could expand more on certain use cases. i.e. if one have a sample space of X as list [], then what steps one should take to calculate the mean and standard deviation.

From that moment onwards, your article provides a clear way forward to get PDF/CDF probabilities..

Thanks,
Omer

Ralf Gommers

20 July 2009 at 12:25

The python help() facility does not do the right thing (not sure why) but ipython does:

In [4]: scipy.stats.norm?

gives the same as the html docs.

Omer Khalid

20 July 2009 at 13:48

If I understood correct; scipy.stats.mean and scipy.stats.std will be sample’s mean and standard deviation but this is not normal distribution yet. right? Because in your examples you are referring to scipy.stats.norm.* ?

I suppose normal-distribution mean/std.deviation could be calculated through these functions: scipy.stats.norm.mean and scipy.stats.norm.std.

Probably it would be best I guess to first fit the sample set to a distribution and get a mean/std.deviation for it. If this could be the right approach then, would fit function will return a distribution type object which one can use for mean, std and probability?

John

20 July 2009 at 13:05

Omer: If you just want to compute sample mean and sample standard deviation, you could use the functions scipy.stats.mean and scipy.stats.std. But if you want to fit parameters for non-normal distributions, look at the fit method on the distribution you’re using to describe your data.

Ralf Gommers

20 July 2009 at 13:48

Actually, I just realized what is going on with python help(). The distributions are class instances, and their docstrings are updated at import time to be instance-specific. IPython is smart enough to then do the right thing, the Python interpreter is not – it just displays the class docstring.

John

20 July 2009 at 14:27

Omer: You can calculate the sample mean and sample variance no matter what distribution your data comes from. However, if your data comes from a normal distribution, the sample mean and sample variance are the best estimators for the population parameters in a technical sense.

Does it look like your data follow a normal distribution? If not, what kind of distribution do they follow? That seems to be the first thing to settle. If you have questions as to the appropriate distribution for your data, perhaps we should continue this discussion via email. See the “About” tab at the top for my contact info.

nick

21 July 2009 at 11:03

I’ve been meaning to see how stat packages organize their distribution families. I thought it would be exactly this, but didn’t want to be hasty. Thanks so much, now I know the right way to proceed in my application!

Nam

17 February 2012 at 09:37

Thanks for explanation!

Djeed

1 March 2012 at 04:27

Thanks for the quick tutorial!

What if I would like to generate a distribution from data?

I have a plot, number of particles vs energy. I would like to get the pdf, cdf and ppf out of it. How would you do that?

John

1 March 2012 at 07:45

Fitting a distribution to data is a more subtle question. First of all, you have to decide exactly what you mean. Do you want to what well-known distribution fits your data? (See this post.) That’s somewhat of a historical question because it’s limiting the search to distributions that other people have given names to. Why should your data have such a distribution?

If you want to be more empirical about fitting your data, you have to decide how much smoothing you want. If you do no smoothing, then the histogram of your data is the PDF, and it’s cumulative sum is the CDF. But that’s not usually what people have in mind. “Kernel smoothing” is the technical term for fitting a smooth density to data.

In any case, you have choices to make, so it’s not as simple as finding “the” PDF.

Djeed

1 March 2012 at 08:53

Dear John, thank you so much for your comments. I would like to clarify my question

My data consists of electron counts vs energy, that is n(E). It is a smooth distribution already, but not feasible for any well-known distribution. In fact, the pdf is actually:
pdf = n(E) / N (N the integral of n(E) in the whole region)

Now, I should run some simulations, so I need to generate m particles in such a distribution.
The problem is that I can’t use ‘scipy.distribution.rvs()’, since that distribution is not included.

Would you have any thoughts on how to proceed?

Comments are closed.