# Using Python as a statistical calculator

This post is for someone unfamiliar with SciPy who wants to use Python to do basic statistical calculations. More specifically, this post will look at working with common families of random variables—normal, exponential, gamma, etc.—and how to calculate probabilities with these.

All the statistical functions you need will be in the `stats` subpackage of SciPy. You could import everything with

`    >>> from scipy.stats import *`

This will make software developers cringe because it’s good software development practice to only import what you need [1]. But we’re not writing software here; we’re using Python as a calculator.

## Distributions

Every distribution in SciPy has a location parameter `loc` which defaults to 0, and scale parameter `scale` that defaults to 1.

The table below lists additional parameters that some common distributions have.

Distribution SciPy name    Parameters
normal `norm`
exponential `expon`
beta `beta` `shape1`, `shape2`
binomial `binom` `size`, `prob`
chi-squared `chi2` `df`
F `f` `df1`, `df2`
gamma `gamma` `shape`
hypergeometric `hypergeom` `M`, `n`, `N`
Poisson `poisson` `lambda`
Student t `t` `df`

When using any statistical software its good to verify that the software is using the parameterization that you expect. For example, the exponential distribution can be parameterized by rate or by scale. SciPy does everything by scale. One way to test the parameterization is to calculate the mean. For example, if the mean of an exp(100) random variable is 100, you’re software is using the scale paraemterization. If the mean is 1/100, it;s using the rate.

## Functions

For a random variable X from one of the families above, we’ll show how to compute Prob(Xx), Prob(Xx), and how to solve for x given one of these probabilities. And for discrete distributions, we’ll show how to find Prob(X = p).

## CDF: Cumulative distribution function

The probability Prob(Xx) is the CDF of X at x. For example, the probability that a standard normal random variable is less than 2 is

`    norm.cdf(2)`

We didn’t have to specify the location or scale because the standard normal uses default parameters. If X had mean 3 and standard deviation 10 we’d call

`    norm.cdf(2, loc=3, scale=10)`

For another example, suppose we’re working with a gamma distribution with shape 3 and scale 5 and want to know the probability of such a random variable taking on a value less than 2. Then we’d call

`    gamma.cdf(2, 3, scale=5)`

The second argument, 2, is the shape parameter.

## CCDF: Complementary cumulative distribution function

The probability Prob(Xx) is the complementary CDF of X at x, which is 1 minus the CDF at x. SciPy uses the name `sf` for the CCDF. Here “sf” stands for survival function.

Why have both a CDF and CCDF function? One reason is convenience, but a more important reason is numerical accuracy. If a probability p is very near 1, then 1 – p will be partially or completely inaccurate.

For example, let X be a standard normal random variable. The probability that X is greater than 9 is

`    norm.sf(9)`

which is on the order of 10-19. But if we compute

`    1 - norm.cdf(9)`

we get exactly 0. Floating point numbers have 16 figures of precision, and we need 19 to represent the difference between our probability and 1. More on why mathematical libraries have functions that seem unnecessary at first here.

## Inverse CDF

SciPy calls the inverse CDF function `ppf` for percentile point function.

For example, suppose X is a gamma random variable with shape 3 and scale 5, and we want to solve for the value of x such that Prob(Xx) is 0.7. We could do this with

`    gamma.ppf(0.7, 3, scale=5)`

## Inverse CCDF

SciPy calls the inverse CCDF function `isf` for inverse survival function. So, for example,

`    gamma.isf(0.3, 3, scale=5)`

should return the same value as the example above because the point where the right tail has probability 0.3 is the same as the point where the left tail has probability 0.7.

## Probability mass function

For a discrete random variable X we can compute the probability mass function, the probability that X takes on a value x. Unsurprisingly SciPy calls this function `pmf`.

For example, the probability of seeing 6 heads out of 10 flips of a fair coin is computed by

`    binom.pmf(6, 10, 0.5)`

## Getting help

To find out more about a probability distribution, call `help` with that distribution as an argument. For example, you could call

`    help(hypergeom)`

***

[1] The reason to not import more than you need is to reduce namespace confusion. If you see a function `foo`, is that the `foo` function from package A or from package B. The most common way this comes up in statistics is confusion over the gamma function and the gamma distribution. In SciPy, the gamma function is in `scipy.special` while the gamma distribution is in `scipy.stats`. If you’ve imported everything from `scipy.stats` then `gamma` means `scipy.stats.gamma`. You could bring the gamma function into your environment by importing it with another name. For example

`    from scipy.special import gamma as gammaf`

would import the gamma function giving it the name `gammaf`.

## One thought on “Using Python as a statistical calculator”

1. Charles McCreary

Nice summary. Thanks!