I’ve added a page for endorsements to my site. Thanks to everyone who let me use their photo and quote.
If you’d like to contribute an endorsement, please contact me.
Mathematical tools such as Bayesian analysis and differential equations allow you to combine your intuition and data to make better decisions.
For over twenty years, I have created and implemented mathematical models to solve problems in business, science, and engineering. Some areas of application include risk assessment, adaptive clinical trial design, computer hardware reliability, and software optimization.
Projects can fail because no one knows each of the pieces well enough to bring them all together.
My role on projects has often been to be the interpreter and integrator. I bring different areas of math together to solve problems. I bring math and software together to implement solutions, and I bring people together by interpreting between scientists, software developers, and business leaders.
I speak the native language of engineers and can communicate technical information to a wider, non-technical audience.
For example, I have helped lawyers understand and convey probability. I have helped salesmen understand what scientific articles are saying about their product and how they can convey that information to customers. I have helped business understand and mitigate risks.
The appeal of magic is that it promises to render objects plastic to the will without one’s getting too entangled with them. Treated at arm’s length, the object can issue no challenge to the self. … The clearest contrast … that I can think of is the repairman, who must submit himself to the broken washing machine, listen to it with patience, notice its symptoms, and then act accordingly. He cannot treat it abstractly; the kind of agency he exhibits is not at all magical.
Related post: Programming languages and magic
From Freeman Dyson:
Economic forecasting is useful for predicting the future up to about ten years ahead. Beyond ten years the quantitative changes which the forecast accesses are usually sidetracked or made irrelevant by qualitative changes in the rules of the game. Qualitative changes are produced by human cleverness … or by human stupidity … Neither cleverness nor stupidity are predictable.
Source: Infinite in All Directions, Chapter 10, Engineers’ Dreams.
From JPL scientist Rich Terrile:
In everyone’s pocket right now is a computer far more powerful than the one we flew on Voyager, and I don’t mean your cell phone—I mean the key fob that unlocks your car.
These days technology is equated with computer technology. For example, the other day I heard someone talk about bringing chemical engineering and technology together, as if chemical engineering isn’t technology. If technology only means computer technology, then the Voyager probes are very low-tech.
And yet Voyager 1 has left the solar system! (Depending on how you define the solar system.*) It’s the most distant man-made object, about 20 billion kilometers away. It’s still sending back data 38 years after it launched, and is expected to keep doing so for a few more years before its power supply runs too low. Voyager 2 is doing fine as well, though it’s taking longer to leave the solar system. Surely this is a far greater technological achievement than a key fob.
* * *
* Voyager 1 has left the heliosphere, far beyond Pluto, and is said to be in the “interstellar medium.” But it won’t reach the Oort cloud for another 300 years and won’t leave the Oort cloud for 30,000 years.
Monte Carlo integration has been called “Integration by Darts,” a clever pun on “integration by parts.” I ran across the phrase looking at some slides by Brian Hayes, but apparently it’s been around a while. The explanation that Monte Carlo is “integration by darts” is fine as a 0th order explanation, but it can be misleading.
Introductory courses explain Monte Carlo integration as follows.
In principle this is correct, but this is far from how Monte Carlo integration is usually done in practice.
For one thing, Monte Carlo integration is seldom used to integrate functions of one variable. Instead, it is mostly used on functions of many variables, maybe hundreds or thousands of variables. This is because more efficient methods exist for low-dimensional integrals, but very high dimensional integrals can usually only be computed using Monte Carlo or some variation like quasi-Monte Carlo.
If you draw a box around your integrand, especially in high dimensions, it may be that nearly all your darts fall outside the region you’re interested in. For example, suppose you throw a billion darts and none land inside the volume determined by your integration problem. Then the point estimate for your integral is 0. Assuming the true value of the integral is positive, the relative error in your estimate is 100%. You’ll need a lot more than a billion darts to get an accurate estimate. But is this example realistic? Absolutely. Nearly all the volume of a high-dimensional cube is in the “corners” and so putting a box around your integrand is naive. (I’ll elaborate on this below. )
So how do you implement Monte Carlo integration in practice? The next step up in sophistication is to use “importance sampling.”  Conceptually you’re still throwing darts at a box, but not with a uniform distribution. You find a probability distribution that approximately matches your integrand, and throw darts according to that distribution. The better the fit, the more efficient the importance sampler. You could think of naive importance sampling as using a uniform distribution as the importance sampler. It’s usually not hard to find an importance sampler much better than that. The importance sampler is so named because it concentrates more samples in the important regions.
Importance sampling isn’t the last word in Monte Carlo integration, but it’s a huge improvement over naive Monte Carlo.
* * *
 So what does it mean to say most of the volume of a high-dimensional cube is in the corners? Suppose you have an n-dimensional cube that runs from -1 to 1 in each dimension and you have a ball of radius 1 inside the cube. To make the example a little simpler, assume n is even, n = 2k. Then the volume of the cube is 4k and the volume of the sphere is πk / k!. If k = 1 (n = 2) then the sphere (circle in this case) takes up π/4 of the volume (area), about 79%. But when k = 100 (n = 200), the ball takes up 3.46×10-169 of the volume of the cube. You could never generate enough random samples from the cube to ever hope to land a single point inside the ball.
 In a nutshell, importance sampling replaces the problem of integrating f(x) with that of integrating (f(x) / g(x)) g(x) where g(x) is the importance sampler, a probability density. Then the integral of (f(x) / g(x)) g(x) is the expected value of (f(X) / g(X)) where X is a random variable with density given by the importance sampler. It’s often a good idea to use an importance sampler with slightly heavier tails than the original integrand.
If you’d like some help with numerical integration, let me know.
Bayesian analysis and Frequentist analysis often lead to the same conclusions by different routes. But sometimes the two forms of analysis lead to starkly different conclusions.
The following illustration of this difference comes from a talk by Luis Pericci last week. He attributes the example to “Bernardo (2010)” though I have not been able to find the exact reference.
In an experiment to test the existence of extra sensory perception (ESP), researchers wanted to see whether a person could influence some process that emitted binary data. (I’m going from memory on the details here, and I have not found Bernardo’s original paper. However, you could ignore the experimental setup and treat the following as hypothetical. The point here is not to investigate ESP but to show how Bayesian and Frequentist approaches could lead to opposite conclusions.)
The null hypothesis was that the individual had no influence on the stream of bits and that the true probability of any bit being a 1 is p = 0.5. The alternative hypothesis was that p is not 0.5. There were N = 104,490,000 bits emitted during the experiment, and s = 52,263,471 were 1’s. The p-value, the probability of an imbalance this large or larger under the assumption that p = 0.5, is 0.0003. Such a tiny p-value would be regarded as extremely strong evidence in favor of ESP given the way p-values are commonly interpreted.
The Bayes factor, however, is 18.7, meaning that the null hypothesis appears to be about 19 times more likely than the alternative. The alternative in this example uses Jeffreys’ prior, Beta(0.5, 0.5).
So given the data and assumptions in this example, the Frequentist concludes there is very strong evidence for ESP while the Bayesian concludes there is strong evidence against ESP.
The following Python code shows how one might calculate the p-value and Bayes factor.
from scipy.stats import binom from scipy import log, exp from scipy.special import betaln N = 104490000 s = 52263471 # sf is the survival function, i.e. complementary cdf # ccdf multiplied by 2 because we're doing a two-sided test print("p-value: ", 2*binom.sf(s, N, 0.5)) # Compute the log of the Bayes factor to avoid underflow. logbf = N*log(0.5) - betaln(s+0.5, N-s+0.5) + betaln(0.5, 0.5) print("Bayes factor: ", exp(logbf))
I’ve resisted using the term “data science,” and enjoy poking fun at it now and then, but I’ve decided it’s not such a bad label after all.
Here are some of the pros and cons of the term. (Listing “cons” first seems backward, but I’m currently leaning toward the pro side, so I thought I should conclude with it.)
The term “data scientist” is sometimes used to imply more novelty than is there. There’s not a great deal of difference between data science and statistics, though the new term is more fashionable. (Someone quipped that data science is statistics on a Mac.)
Similarly, the term data scientist is sometimes used as an excuse for ignorance, as in “I don’t understand probability and all that stuff, but I don’t need to because I’m a data scientist, not a statistician.”
The big deal about data science isn’t data but the science of drawing inferences from the data. Inference science would be a better term, in my opinion, but that term hasn’t taken off.
Data science could be a useful umbrella term for statistics, machine learning, decision theory, etc. Also, the title data scientist is rightfully associated with people who have better computational skills than statisticians typically have.
While the term data science isn’t perfect, there’s little to recommend the term statistics other than that it is well established. The root of statistics is state, as in a government. This is because statistics was first applied to the concerns of bureaucracies. The term statistics would be equivalent to governmentistics, a historically accurate but otherwise useless term.
To tell whether a statement about data is over-hyped, see whether it retains its meaning if you replace data with measurements.
So a request like “Please send me the data from your experiment” becomes “Please send me the measurements from your experiment.” Same thing.
But rousing statements about the power of data become banal or even ridiculous. For example, here’s an article from Forbes after substituting measurements for data:
The Hottest Jobs In IT: Training Tomorrow’s Measurements Scientists
If you thought good plumbers and electricians were hard to find, try getting hold of a measurements scientist. The rapid growth of big measurements and analytics for use within businesses has created a huge demand for people capable of extracting knowledge from measurements.
Some of the top positions in demand include business intelligence analysts, measurements architects, measurements warehouse analysts and measurements scientists, Reed says. “We believe the demand for measurements expertise will continue to grow as more companies look for ways to capitalize on this information,” he says.
Arguments over the difference between statistics and machine learning are often pointless. There is a huge overlap between the two approaches to analyzing data, sometimes obscured by differences in vocabulary. However, there is one distinction that is helpful. Statistics aims to build accurate models of phenomena, implicitly leaving the exploitation of these models to others. Machine learning aims to solve problems more directly, and sees its models as intermediate artifacts; if an unrealistic model leads to good solutions, it’s good enough.
This distinction is valid in broad strokes, though things are fuzzier than it admits. Some statisticians are content with constructing models, while others look further down the road to how the models are used. And machine learning experts vary in their interest in creating accurate models.
Clinical trial design usually comes under the heading of statistics, though in spirit it’s more like machine learning. The goal of a clinical trial is to answer some question, such as whether a treatment is safe or effective, while also having safeguards in place to stop the trial early if necessary. There is an underlying model—implicit in some methods, more often explicit in newer methods—that guides the conduct of the trial, but the accuracy of this model per se is not the primary goal. Some designs have been shown to be fairly robust, leading to good decisions even when the underlying probability model does not fit well. For example, I’ve done some work with clinical trial methods that model survival times with an exponential distribution. No one believes that an exponential distribution, i.e. one with constant hazard, accurately models survival times in cancer trials, and yet methods using these models do a good job of stopping trials early that should stop early and letting trials continue that should be allowed to continue.
Experts in machine learning are more accustomed to the idea of inaccurate models sometimes producing good results. The best example may be naive Bayes classifiers. The word “naive” in the name is a frank admission that these classifiers model as independent events known not to be independent. These methods can do well at their ultimate goal, such as distinguishing spam from legitimate email, even though they make a drastic simplifying assumption.
There have been papers that look at why naive Bayes works surprisingly well. Naive Bayes classifiers work well when the errors due to wrongly assuming independence effect positive and negative examples roughly equally. The inaccuracies of the model sort of wash out when the model is reduced to a binary decision, classifying as positive or negative. Something similar happens with the clinical trial methods mentioned above. The ultimate goal is to make correct go/no-go decisions, not to accurately model survival times. The naive exponential assumption effects both trials that should and should not stop, and the model predictions are reduced to a binary decision.
Sometimes you only need a rough fit to some data and a triangular distribution will do. As the name implies, this is a distribution whose density function graph is a triangle. The triangle is determined by its base, running between points a and b, and a point c somewhere in between where the altitude intersects the base. (c is called the foot of the altitude.) The height of the triangle is whatever it needs to be for the area to equal 1 since we want the triangle to be a probability density.
One way to fit a triangular distribution to data would be to set a to the minimum value and b to the maximum value. You could pick a and b are the smallest and largest possible values, if these values are known. Otherwise you could use the smallest and largest values in the data, or make the interval a little larger if you want the density to be positive at the extreme data values.
How do you pick c? One approach would be to pick it so the resulting distribution has the same mean as the data. The triangular distribution has mean
(a + b + c)/3
so you could simply solve for c to match the sample mean.
Another approach would be to pick c so that the resulting distribution has the same median as the data. This approach is more interesting because it cannot always be done.
Suppose your sample median is m. You can always find a point c so that half the area of the triangle lies to the left of a vertical line drawn through m. However, this might require the foot c to be to the left or the right of the base [a, b]. In that case the resulting triangle is obtuse and so sides of the triangle do not form the graph of a function.
For the triangle to give us the graph of a density function, c must be in the interval [a, b]. Such a density has a median in the range
[b – (b – a)/√2, a + (b – a)/√2].
If the sample median m is in this range, then we can solve for c so that the distribution has median m. The solution is
c = b – 2(b – m)2 / (b – a)
if m < (a + b)/2 and
c = a + 2(a – m)2 / (b – a)
If you train a model on a set of data, it should fit that data well. The hope, however, is that it will fit a new set of data well. So in machine learning and statistics, people split their data into two parts. They train the model on one half, and see how well it fits on the other half. This is called cross validation, and it helps prevent over-fitting, fitting a model too closely to the peculiarities of a data set.
For example, suppose you have measured the value of a function at 100 points. Unbeknownst to you, the data come from a cubic polynomial plus some noise. You can fit these 100 points exactly with a 99th degree polynomial, but this gives you the illusion that you’ve learned more than you really have. But if you divide your data into test and training sets of 50 points each, overfitting on the training set will result in a terrible fit on the test set. If you fit a cubic polynomial to the training data, you should do well on the test set. If you fit a 49th degree polynomial to the training data, you’ll fit it perfectly, but do a horrible job with the test data.
Now suppose we have two kinds of models to fit. We train each on the training set, and pick the one that does better on the test set. We’re not over-fitting because we haven’t used the test data to fit our model. Except we really are: we used the test set to select a model, though we didn’t use the test set to fit the parameters in the two models. Think of a larger model as a tree. The top of the tree tells you which model to pick, and under that are the parameters for each model. When we think of this new hierarchical model as “the model,” then we’ve used our test data to fit part of the model, namely to fit the bit at the top.
With only two models under consideration, this isn’t much of a problem. But if you have a machine learning package that tries millions of models, you can be over-fitting in a subtle way, and this can give you more confidence in your final result than is warranted.
The distinction between parameters and models is fuzzy. That’s why “Bayesian model averaging” is ultimately just Bayesian modeling. You could think of the model selection as simply another parameter. Or you could go the other way around and think of of each parameter value as an index for a family of models. So if you say you’re only using the test data to select models, not parameters, you could be fooling yourself.
For example, suppose you want to fit a linear regression to a data set. That is, you want to pick m and b so that y = mx + b is a good fit to the data. But now I tell you that you are only allowed to fit models with one degree of freedom. You’re allowed to do cross validation, but you’re only allowed to use the test data for model selection, not model fitting.
So here’s what you could do. Pick a constant value of b, call it b0. Now fit the one-parameter model y = mx + b0 on your training data, selecting the value of m only to minimize the error in fitting the training set. Now pick another value of b, call it b1, and see how well it does on the test set. Repeat until you’ve found the best value of b. You’ve essentially used the training and test data to fit a two-parameter model, albeit awkwardly.
I suspect there’s a huge opportunity in moving mathematics from the pure column to the applied column. There may be a lot of useful math that never sees application because the experts are unconcerned with or unaware of applications.
In particular I wonder what applications there may be of number theory, especially analytic number theory. I’m not thinking of the results of number theory but rather the elegant machinery developed to attack problems in number theory. I expect more of this machinery could be useful to problems outside of number theory.
I also wonder about category theory. The theory certainly finds uses within pure mathematics, but I’m not sure how useful it is in direct application to problems outside of mathematics. Many of the reported applications don’t seem like applications at all, but window dressing applied after-the-fact. On the other hand, there are also instances where categorical thinking led the way to a solution, but did its work behind the scenes; once a solution was in hand, it could be presented more directly without reference to categories. So it’s hard to say whether applications of category theory are over-reported or under-reported.
The mathematical literature can be misleading. When researchers say their work has various applications, they may be blowing smoke. At the same time, there may be real applications that are never mentioned in journals, either because the work is proprietary or because it is not deemed original in the academic sense of the word.
In Book VIII of Paradise Lost, the angel Raphael tells Adam what difficulties men will have with astronomy:
Hereafter, when they come to model heaven
And calculate the stars: how they will wield the
The mighty frame, how build, unbuild, contrive
To save appearances, how gird the sphere
With centric and eccentric scribbled o’er,
Cycle and epicycle, orb in orb.
Related post Quaternions in Paradise Lost
Every positive integer is either part of the sequence ⌊ nπ ⌋ or the sequence ⌊ nπ/(π – 1) ⌋ where n ranges over positive integers, and no positive integer is in both sequences.
This is a special case of Beatty’s theorem.
One objection to modeling adult heights with a normal distribution is that the former is obviously positive but the latter can be negative. However, by this model negative heights are astronomically unlikely. I’ll explain below how one can take “astronomically” literally in this context.
A common model says that men’s and women’s heights are normally distributed with means of 70 and 64 inches respectively, both with a standard deviation of 3 inches. A woman with negative height would be 21.33 standard deviations below the mean, and a man with negative height would be 23.33 standard deviations below the mean. These events have probability 3 × 10-101 and 10-120 respectively. Or to write them out in full
As I mentioned on Twitter yesterday, if you’re worried about probabilities that require scientific notation to write down, you’ve probably exceeded the resolution of your model. I imagine most probability models are good to two or three decimal places at most. When model probabilities are extremely small, factors outside the model become more important than ones inside.
According to Wolfram Alpha, there are around 1080 atoms in the universe. So picking one particular atom at random from all atoms in the universe would be on the order of a billion trillion times more likely than running into a woman with negative height. Of course negative heights are not just unlikely, they’re impossible. As you travel from the mean out into the tails, the first problem you encounter with the normal approximation is not that the probability of negative heights is over-estimated, but that the probability of extremely short and extremely tall people is under-estimated. There exist people whose heights would be impossibly unlikely according to this normal approximation. See examples here.
Probabilities such as those above have no practical value, but it’s interesting to see how you’d compute them anyway. You could find the probability of a man having negative height by typing
pnorm(-23.33) into R or
scipy.stats.norm.cdf(-23.33) into Python. Without relying on such software, you could use the bounds
with x equal to -21.33 and -23.33. For a proof of these bounds and tighter bounds see these notes.