In the previous post, I looked at truncated probability distributions. A truncated normal distribution, for example, lives on some interval and has a density proportional to a normal distribution; the proportionality constant is whatever it has to be to make the density integrate to 1.
Suppose you wanted to generate random samples from a normal distribution truncated to [a, b]. How would you go about it? You could generate a sample x from the (full) normal distribution, and if a ≤ x ≤ b, then return x. Otherwise, try again. Keep trying until you get a sample inside [a, b], and when you get one, return it .
This may be the most efficient way to generate samples from a truncated distribution. It’s called the accept-reject method. It works well if the probability of a normal sample being in [a, b] is high. But if the interval is small, a lot of samples may be rejected before one is accepted, and there are more efficient algorithms. You might also use a different algorithm if you want consistent run time more than minimal average run time.
Now consider a different kind of sampling. As before, we generate a random sample x from the full normal distribution and if a ≤ x ≤ b we return it. But if x > b we return b. And if x < a we return a. That is, we clip x to be in [a, b]. I’ll call this a clipped normal distribution. (I don’t know whether this name is standard, or even if there is a standard name for this.)
Clipped distributions are not truncated distributions. They’re really a mixture of three distributions. Start with a continuous random variable X on the real line, and let Y be X clipped to [a, b]. Then Y is a mixture of a continuous distribution on [a, b] and two point masses: Y takes on the value a with probability P(X < a) and Y takes on the value b with probability P(X > b).
Clipped distributions are mostly a nuisance. Maybe the values are clipped due to the resolution of some sensor: nobody wants to clip the values, that’s just a necessary artifact. Or maybe values are clipped because some form only allows values within a given range; maybe the person designing the form didn’t realize the true range of possible values.
It’s common to clip extreme data values while deidentifying data to protect privacy. This means that not only is some data deliberately inaccurate, it also means that the distribution changes slightly since it’s now a clipped distribution. If that’s unacceptable, there are other ways to balance privacy and statistical utility.
 Couldn’t this take forever? Theoretically, but with probability zero. The number of attempts follows a geometric distribution, and you could use that to find the expected number of attempts.
4 thoughts on “Truncated distributions vs clipped distributions”
How does your “clipped” distribution differ from a “censored” distribution?
When I was engineering sensors and instruments, we called the values at which readings clipped “end buckets”. They contain lots of useful information, especially if their time behavior is tracked in addition to their value and count.
“End buckets” are generally an intentional design feature, and may exist for one or more of a variety of reasons. Perhaps surprisingly, few of those reasons are to reduce cost. More often it is simply to extract the linear portion from a very non-linear underlying sensor, where we’d prefer to restrict the range rather than overly process the readings or endanger calibration stability.
Most of the instruments I helped develop were so-called “novel physics” sensors, where we’d optimize the “undesirable side-effects” of a device used in one domain to apply as a sensor in a very different domain. Such as using a single gargantuan cryogenic FET as a hyper-sensitive radiation detector. The original project for which these FETs were created was considered a failure, and we bought up the entire lot at nearly scrap prices, then sold them in systems costing $1M.
That particular sensor had “end buckets” simply because we were unable to use its entire dynamic range while maintaining sensitivity, which is a very delightful (and very rare) problem to have.
Perhaps the most common sensors with “end buckets” are digital cameras. These sensors have no true zero, and their saturation point can vary for multiple reasons. Manufacturers tend to “correct” the pixel values using several methods, unfortunately sometimes even in so-called “raw” images.
In an ultra-high-speed (100,000 fps) video camera I helped create, we used the “end buckets” to dynamically adjust image calibration (gain and offset), and (optionally) exposure and/or frame rate, to maximize the useful range (tiny end buckets near the ADC limits). The user could also specify the limits, when maximum pixel resolution was needed over a limited range.
But why calibrate a video sensor on a frame-by-frame basis at up to 100,000 frames per second? This was driven by another feature of our camera: It’s physical construction was designed to survive a lifetime of 100g impacts. Where do you experience such impacts? Two examples are inside cars during crash testing, and during weapons testing. In each case the lighting can change very quickly, where the lights often go out during a crash test, and things suddenly get very bright during a weapons test.
Our favorite demo video for this feature was of a tungsten filament while it was being massively over-driven: We had excellent exposures before, during and after it melted.
A customer saw this then sent us a video of their use: Our camera was placed 10 meters from the impact site of a Tomahawk missile test. The camera shelter was destroyed and the camera was tossed 50 meters with the lens torn off (the camera was fine: the lens mount was designed to be a consumable).
But the video was awesome, with each frame of the explosion being perfectly exposed. Even after the blast wave reached the camera. Even after the lens was torn off. Right up until the power cable snapped. (Data captured but not yet sent down the data cable was battery-backed for 24 hours, to allow time to find the camera and replace the connector).
No, the camera did not have image stabilization.
A clipped distribution might also make sense if some physical or logical constraint prevents a value from being too large or small.
For example, how much precipitation falls in a given location on a given day? As a first approximation, we might imagine a large number of random events that add or subtract some amount from the total, so it might be something like a normal distribution – but we know it can’t be less than 0, so we clip the distribution at 0. The resulting distribution is continuous for positive numbers and has positive mass at 0, which is what we expect to see.
“Clipped” is good, although you might consider calling them “hurdle models,” (or two-sided hurdle models?) since the distribution is determined by whether the random variate crosses a threshold or hurdle.
The discussion reminds me of the difference between trimmed means and Winsorized means. The earliest robust statistics for location (other than the median!) were :
1. The trimmed mean, in which you exclude the k largest and k smallest values and compute the mean of the remaining values. This is like a truncation, except the exclusion is based (symmetrically) on the order statistics rather than an interval [a,b].
2. The Winsorized mean, in which the k smallest values are replaced by the (k+1)st value and the k largest values are replaced by the (N-k)th value. Again, the Winsorized mean is a symmetric process on the order statistics, whereas your discussion of clipped distributions is determined by an interval.