Here are some notes on how to work with probability distributions using the SciPy numerical library for Python.
Functions related to probability distributions are located in scipy.stats
. The general pattern is
scipy.stats.<distribution family>.<function>
There are 81 supported continuous distribution families and 12 discrete distribution families. Some distributions have obvious names: gamma
, cauchy
, t
, f
, etc. The only possible surprise is that all distributions begin with a lower-case letter, even those corresponding to a proper name (e.g. Cauchy). Other distribution names are less obvious: expon
for the exponential, chi2
for chi-squared distribution, etc.
Each distribution supports several functions. The density and cumulative distribution functions are pdf
and cdf
respectively. (Discrete distributions use pmf
rather than pdf
.) One surprise here is that the inverse CDF function is called ppf
for “percentage point function.” I’d never heard that terminology and would have expected something like “quantile.”
Example: scipy.stats.beta.cdf(0.1, 2, 3)
evaluates the CDF of a beta(2, 3) random variable at 0.1.
Random values are generated using rvs
which takes an optional size
argument. The size is set to 1 by default.
Example: scipy.stats.norm.rvs(2, 3)
generates a random sample from a normal (Gaussian) random variable with mean 2 and standard deviation 3. The function call scipy.stats.norm.rvs(2, 3, size = 10)
returns an array of 10 samples from the same distribution.
The command line help()
facility does not document the distribution parameterizations, but the external documentation does. Most distributions are parameterized in terms of location and scale. This means, for example, that the exponential distribution is parameterized in terms of its mean, not its rate. Somewhat surprisingly, the exponential distribution has a location parameter. This means, for example, that scipy.stats.expon.pdf(x, 7)
evaluates at x the PDF of an exponential distribution with location 7. This is not what I expected. I assumed there would be no location parameter and that the second argument, 7, would be the mean (scale). Instead, the location was set to 7 and the scale was left at its default value 1. Writing scipy.stats.expon.pdf(x, scale=7)
would have given the expected result because the default location value is 0.
SciPy also provides constructors for objects representing random variables.
Example: x = scipy.stats.norm(3, 1); x.cdf(2.7)
returns the same value as scipy.stats.norm.cdf(2.7, 3, 1)
.
Constructing objects representing random variables encapsulates the differences between distributions in the constructors. For example, some distributions take more parameters than others and so their object constructors require more arguments. But once a distribution object is created, its PDF, for example, can be called with a single argument. This makes it easier to write code that takes a general distribution object as an argument.
A good step in the right direction…
It would be good John if you could expand more on certain use cases. i.e. if one have a sample space of X as list [], then what steps one should take to calculate the mean and standard deviation.
From that moment onwards, your article provides a clear way forward to get PDF/CDF probabilities..
Thanks,
Omer
The python help() facility does not do the right thing (not sure why) but ipython does:
In [4]: scipy.stats.norm?
gives the same as the html docs.
If I understood correct; scipy.stats.mean and scipy.stats.std will be sample’s mean and standard deviation but this is not normal distribution yet. right? Because in your examples you are referring to scipy.stats.norm.* ?
I suppose normal-distribution mean/std.deviation could be calculated through these functions: scipy.stats.norm.mean and scipy.stats.norm.std.
Probably it would be best I guess to first fit the sample set to a distribution and get a mean/std.deviation for it. If this could be the right approach then, would fit function will return a distribution type object which one can use for mean, std and probability?
Omer: If you just want to compute sample mean and sample standard deviation, you could use the functions
scipy.stats.mean
andscipy.stats.std
. But if you want to fit parameters for non-normal distributions, look at thefit
method on the distribution you’re using to describe your data.Actually, I just realized what is going on with python help(). The distributions are class instances, and their docstrings are updated at import time to be instance-specific. IPython is smart enough to then do the right thing, the Python interpreter is not – it just displays the class docstring.
Omer: You can calculate the sample mean and sample variance no matter what distribution your data comes from. However, if your data comes from a normal distribution, the sample mean and sample variance are the best estimators for the population parameters in a technical sense.
Does it look like your data follow a normal distribution? If not, what kind of distribution do they follow? That seems to be the first thing to settle. If you have questions as to the appropriate distribution for your data, perhaps we should continue this discussion via email. See the “About” tab at the top for my contact info.
I’ve been meaning to see how stat packages organize their distribution families. I thought it would be exactly this, but didn’t want to be hasty. Thanks so much, now I know the right way to proceed in my application!
Thanks for explanation!
Thanks for the quick tutorial!
What if I would like to generate a distribution from data?
I have a plot, number of particles vs energy. I would like to get the pdf, cdf and ppf out of it. How would you do that?
Fitting a distribution to data is a more subtle question. First of all, you have to decide exactly what you mean. Do you want to what well-known distribution fits your data? (See this post.) That’s somewhat of a historical question because it’s limiting the search to distributions that other people have given names to. Why should your data have such a distribution?
If you want to be more empirical about fitting your data, you have to decide how much smoothing you want. If you do no smoothing, then the histogram of your data is the PDF, and it’s cumulative sum is the CDF. But that’s not usually what people have in mind. “Kernel smoothing” is the technical term for fitting a smooth density to data.
In any case, you have choices to make, so it’s not as simple as finding “the” PDF.
Dear John, thank you so much for your comments. I would like to clarify my question
My data consists of electron counts vs energy, that is n(E). It is a smooth distribution already, but not feasible for any well-known distribution. In fact, the pdf is actually:
pdf = n(E) / N (N the integral of n(E) in the whole region)
Now, I should run some simulations, so I need to generate m particles in such a distribution.
The problem is that I can’t use ‘scipy.distribution.rvs()’, since that distribution is not included.
Would you have any thoughts on how to proceed?