Sometimes you only need a rough fit to some data and a triangular distribution will do. As the name implies, this is a distribution whose density function graph is a triangle. The triangle is determined by its base, running between points *a* and *b*, and a point *c* somewhere in between where the altitude intersects the base. (*c* is called the *foot* of the altitude.) The height of the triangle is whatever it needs to be for the area to equal 1 since we want the triangle to be a probability density.

One way to fit a triangular distribution to data would be to set *a* to the minimum value and *b* to the maximum value. You could pick *a* and *b* are the smallest and largest *possible* values, if these values are known. Otherwise you could use the smallest and largest values in the data, or make the interval a little larger if you want the density to be positive at the extreme data values.

How do you pick *c*? One approach would be to pick it so the resulting distribution has the same mean as the data. The triangular distribution has mean

(*a* + *b* + *c*)/3

so you could simply solve for *c* to match the sample mean.

Another approach would be to pick *c* so that the resulting distribution has the same *median* as the data. This approach is more interesting because it cannot always be done.

Suppose your sample median is *m*. You can always find a point *c* so that half the area of the triangle lies to the left of a vertical line drawn through *m*. However, this might require the foot *c* to be to the left or the right of the base [*a*, *b*]. In that case the resulting triangle is obtuse and so sides of the triangle do not form the graph of a function.

For the triangle to give us the graph of a density function, *c* must be in the interval [*a*, *b*]. Such a density has a median in the range

[*b* – (*b* – *a*)/√2, *a* + (*b* – *a*)/√2].

If the sample median *m* is in this range, then we can solve for *c* so that the distribution has median *m*. The solution is

*c* = *b* – 2(*b* – *m*)^{2} / (*b* – *a*)

if *m* < (*a* + *b*)/2 and

*c* = *a* + 2(*a* – *m*)^{2} / (*b* – *a*)

otherwise.

* * *

For daily tips on data science, follow @DataSciFact on Twitter.

For the obtuse triangle case, would it make sense to switch the base to the long side?

No, because the base must be the range of the data.

How would you *use* a triangular distribution (as above) in your data analysis?

When you need a quick-and-dirty fit, or when a distribution really is triangular. The absolute value of the difference of two uniform r.v.s has a triangular distribution.

Ohh, the fact that the absolute value of the difference of two uniform r.v.s. has a triangular distribution makes this super relevant!

I’m trying to think of a case where the simplicity of triangular distributions makes up for their inaccuracy, and I’m failing. Nothing in reality has a triangular distribution, and the statistical properties of triangular distributions — especially when you start adding them — can lead you to make really bad decisions.

The classic example is in PERT* and related techniques for project management, where the durations of individual tasks in a project are traditionally modeled with triangular distributions. The distribution of the duration of the project is assumed to be the distribution of the sum of the triangular distributions on the critical path. In practice, this is usually wildly wrong; the tail of the distribution of the sum isn’t nearly as long or fat as it needs to be.

*”Programme Evaluation and Review Technique”, iirc