# Corners stick out more in high dimensions

High-dimensional geometry is full of surprises. For example, nearly all the area of a high-dimensional sphere is near the equator, and by symmetry it doesn’t matter which equator you take.

Here’s another surprise: corners stick out more in high dimensions. Hypercubes, for example, become pointier as dimension increases.

How might we quantify this? Think of a pyramid and a flag pole. If you imagine a ball centered at the top of a pyramid, a fair proportion of the volume of the ball contains part of the pyramid. But if you do the same for a flag pole, only a small proportion of the ball contains pole; nearly all the volume of the ball is air.

So one way to quantify how pointy a corner is would be to look at a neighborhood of the corner and measure how much of the neighborhood intersects the solid that the corner is part of. The less volume, the pointier the corner.

Consider a unit square. Put a disk of radius r at a corner, with r < 1. One quarter of that disk will be inside the square. So the proportion of the square near a particular corner is πr²/4, and the proportion of the square near any corner is πr².

Now do the analogous exercise for a unit cube. Look at a ball of radius r < 1 centered at a corner. One eighth of the volume of that ball contains part of the cube. The proportion of cube’s volume located within a distance r of a particular corner is πr³/6, and the proportion located within a distance r of any corner is 4πr³/3.

The corner of a cube sticks out a little more than the corner of a square. 79% of a square is within a distance 0.5 of a corner, while the proportion is 52% for a cube. In that sense, the corners of a cube stick out a little more than the corners of a square.

Now let’s look at a hypercube of dimension n. Let V be the volume of an n-dimensional ball of radius r < 1. The proportion of the hypercube’s volume located within a distance r of a particular corner is V / 2n and the proportion located with a distance r of any corner is simply V.

The equation for the volume V is

If we fix r and let n vary, this function decreases rapidly as n increases.

Saying that corners stick out more in high dimensions is a corollary of the more widely known fact that a ball in a box takes up less and less volume as the dimension of the ball and the box increase.

Let’s set r = 1/2 and plot how the volume of a ball varies with dimension n.

You could think of this as the volume of a ball sitting inside a unit hypercube, or more relevant to the topic of this post, the proportion of the volume of the hypercube located with a distance 1/2 of a corner.

# Nearly all the area in a high-dimensional sphere is near the equator

Nearly all the area of a high-dimensional sphere is near the equator.  And by symmetry, it doesn’t matter which equator you take. Draw any great circle and nearly all of the area will be near that circle.  This is the canonical example of “concentration of measure.”

What exactly do we mean by “nearly all the area” and “near the equator”? You get to decide. Pick your standard of “nearly all the area,” say 99%, and your definition of “near the equator,” say within 5 degrees. Then it’s always possible to take the dimension high enough that your standards are met. The more demanding your standard, the higher the dimension will need to be, but it’s always possible to pick the dimension high enough.

This result is hard to imagine. Maybe a simulation will help make it more believable.

In the simulation below, we take as our “north pole” the point (1, 0, 0, 0, …, 0). We could pick any unit vector, but this choice is convenient. Our equator is the set of points orthogonal to the pole, i.e. that have first coordinate equal to zero. We draw points randomly from the sphere, compute their latitude (i.e. angle from the equator), and make a histogram of the results.

The area of our planet isn’t particularly concentrated near the equator.

But as we increase the dimension, we see more and more of the simulation points are near the equator.

Here’s the code that produced the graphs.

from scipy.stats import norm
from math import sqrt, pi, acos, degrees
import matplotlib.pyplot as plt

def pt_on_sphere(n):
# Return random point on unit sphere in R^n.
# Generate n standard normals and normalize length.
x = norm.rvs(0, 1, n)
length = sqrt(sum(x**2))
return x/length

def latitude(x):
# Latitude relative to plane with first coordinate zero.
angle_to_pole = acos(x[0]) # in radians
latitude_from_equator = 0.5*pi - angle_to_pole
return degrees( latitude_from_equator )

N = 1000 # number of samples

for n in [3, 30, 300, 3000]: # dimension of R^n

latitudes = [latitude(pt_on_sphere(n)) for _ in range(N)]
plt.hist(latitudes, bins=int(sqrt(N)))
plt.xlabel("Latitude in degrees from equator")
plt.title("Sphere in dimension {}".format(n))
plt.xlim((-90, 90))
plt.show()


Not only is most of the area near the equator, the amount of area outside a band around the equator decreases very rapidly as you move away from the band. You can see that from the histograms above. They look like a normal (Gaussian) distribution, and in fact we can make that more precise.

If A is a band around the equator containing at least half the area, then the proportion of the area a distance r or greater from A is bound by exp( -(n-1)r² ). And in fact, this holds for any set A containing at least half the area; it doesn’t have to be a band around the equator, just any set of large measure.

# Valuing results and information

Chris Wiggins gave an excellent talk at Rice this afternoon on data science at the New York Times. In the Q&A afterward, someone asked how you would set up a machine learning algorithm where you’re trying to optimize for outcomes and for information.

Here’s how I’ve approached this dilemma in the past. Information and outcomes are not directly comparable, so any objective function combining the two is going to add incommensurate things. One way around this is to put a value not on information per se but on what you’re going to do with the information.

In the context of a clinical trial, you want to treat patients in the trial effectively, and you want a high probability of picking the best treatment at the end. You can’t add patients and probabilities meaningfully. But why do you want to know which treatment is best? So you can treat future patients effectively. The value of knowing which treatment is best is the increase in the expected number of future successful treatments. This gives you a meaningful objective: maximize the expected number of effective treatments, of patients in the trial and future patients treated as a result of the trial.

The hard part is guessing what will be done with the information learned in the trial. Of course this isn’t exactly known, but it’s the key to estimating the value of the information. If nobody will be treated any differently in the future no matter how the trial turns out—and unfortunately this may be the case—then the trial should be optimized strictly for the benefit of the people in the trial. On the other hand, if a trial will determine the standard of care for thousands of future patients, then there is more value in picking the best treatment.

# The big deal about neural networks

In their book Computer Age Statistical Inference, Brad Efron and Trevor Hastie give a nice description of neutral networks and deep learning.

The knee-jerk response [to neural networks] from statisticians was “What’s the big deal? A neural network is just a nonlinear model, not too different from many other generalizations of linear models.”

While this may be true, neural networks brought a new energy to the field. They could be scaled up and generalized in a variety of ways … And most importantly, they were able to solve problems on a scale far exceeding what the statistics community was used to. This was part computing scale expertise, part liberated thinking and creativity on the part of this computer science community.

After enjoying considerable popularity for a number of years, neural networks were somewhat sidelined by new inventions in the mid 1990’s. … Neural networks were passé. But then they re-emerged with a vengeance after 2010 … the reincarnation now being called deep learning.

# Demystifying artificial intelligence

Computers do what we tell them to do. Period. Any talk of computers doing things they weren’t programmed to do is only a way of speaking. It’s a convenient shorthand when used properly, misleading mysticism when used improperly.

When you write a program

        print(24*7)

you could say that the computer isn’t programmed to print the number 168 in the sense that the code did not say

        print(168)

But of course the computer was programmed to print the number 168. It just wasn’t directly programmed to do so. Instead, it was given data and algorithms to apply to that data to produce the result. The program isn’t very useful because the degree of indirection is tiny. Artificial intelligence is more interesting because it increases the degree of indirection, but it’s still software instructing a computer to take in data and apply algorithms.

When someone says

A computer did this without being programmed!

mentally edit their statement to say

A computer did this without being directly programmed to do so!

The latter may still be impressive, but it’s not magic.

I was at a presentation once where software vendors were claiming that their software “discovered” the equation of motion for a pendulum. The software wasn’t directly programmed to do this, but it was programmed to read in data and find the best fit to the data from a set of basis functions which included sines and cosines. And it correctly found that a linear combination of sines and cosines best described the pendulum’s motion.

The pendulum example was not that impressive, though some applications of artificial intelligence truly are impressive, delivering results several layers of abstraction away from what the software was directly programmed to do. If there’s anything mysterious involved, it’s the statistical regularity of the world that allowed the software to make correct inference from the data it was given.

# Trends and Opportunities in Data Analysis

Andy Warhol said “In the future, everyone will be world-famous for 15 minutes.” Here’s my 15 seconds of fame, a soundbite from the IBM Insight conference last year.

My comments start at 1:30. In a nutshell, I predict that data analytics will work its way down from large companies to small companies.

# Big p, Little n

Statisticians use n to denote the number of subjects in a data set and p to denote nearly everything else. You’re supposed to know from context what each p means.

In the phrase “big n, little p” the symbol p means the number of measurements per subject. Traditional data sets are “big n, little p” because you have far more subjects than measurements per subject. For example, maybe you measure 10 things about 1000 patients.

Big data sets, such as those coming out of bioinformatics, are often “big p, little n.” For example, maybe you measure 20,000 biomarkers on 50 patients. This turns classical statistics sideways, literally and figuratively, literally in the sense that a “big p, little n” data set looks like the transpose of a “big n, little p” data set.

From the vantage point of a traditional statistician, “big p, little n” data sets give you very little to work with. If n is small, it doesn’t matter how big p is. In the example above, n = 50, not a big data set. But the biologist will say “What do you mean it’s not a big data set? I’ve given you 1,000,000 measurements!”

So how to you take advantage of large p even though n is small? That’s the big question. It summarizes the research program of many people in statistics and machine learning. There’s no general answer, at least not yet, though progress is being made in specific applications.

Related post: Nomenclatural abomination

This is what the book Social Media Mining calls the Big Data Paradox:

Social media data is undoubtedly big. However, when we zoom into individuals for whom, for example, we would like to make relevant recommendations, we often have little data for each specific individual. We have to exploit the characteristics of social media and use its multidimensional, multisource, and multisite data to aggregate information with sufficient statistics for effective mining.

… enormous data sets often consist of enormous numbers of small sets of data, none of which by themselves are enough to solve the thing you are interested in, and they fit together in some complicated way.

Big data doesn’t always tell us directly what we’d like to know. It may give us a gargantuan amount of slightly related data, from which we may be able to tease out what we want.

Related post: New data, not just bigger data

# The Fast Fourier Transform (FFT) and big data

The most direct way to compute a Fourier transform numerically takes O(n2) operations. The Fast Fourier Transform (FFT) can compute the same result in O(n log n) operations. If n is large, this can be a huge improvement.

James Cooley and John Tukey (re)discovered the FFT in 1965. It was thought to be an original discovery at the time. Only later did someone find a sketch of the algorithm in the papers of Gauss.

Daniel Rockmore wrote the article on the Fast Fourier Transform in The Princeton Companion to Applied Mathematics. He says

In this new world of 1960s “big data,” a clever reduction in computational complexity (a term not yet widely in use) could make a tremendous difference.

Rockmore adds a very interesting footnote to the sentence above:

Many years later Cooley told me that he believed that the Fast Fourier transform could be thought of as one of the inspirations for asymptotic algorithmic analysis and the study of computational complexity, as previous to the publication of his paper with Tukey very few people had considered data sets large enough to suggest the utility of an asymptotic analysis.

Related posts:

# Adaptive clinical trials and machine learning

Arguments over the difference between statistics and machine learning are often pointless. There is a huge overlap between the two approaches to analyzing data, sometimes obscured by differences in vocabulary. However, there is one distinction that is helpful. Statistics aims to build accurate models of phenomena, implicitly leaving the exploitation of these models to others. Machine learning aims to solve problems more directly, and sees its models as intermediate artifacts; if an unrealistic model leads to good solutions, it’s good enough.

This distinction is valid in broad strokes, though things are fuzzier than it admits. Some statisticians are content with constructing models, while others look further down the road to how the models are used. And machine learning experts vary in their interest in creating accurate models.

Clinical trial design usually comes under the heading of statistics, though in spirit it’s more like machine learning. The goal of a clinical trial is to answer some question, such as whether a treatment is safe or effective, while also having safeguards in place to stop the trial early if necessary. There is an underlying model—implicit in some methods, more often explicit in newer methods—that guides the conduct of the trial, but the accuracy of this model per se is not the primary goal. Some designs have been shown to be fairly robust, leading to good decisions even when the underlying probability model does not fit well. For example, I’ve done some work with clinical trial methods that model survival times with an exponential distribution. No one believes that an exponential distribution, i.e. one with constant hazard, accurately models survival times in cancer trials, and yet methods using these models do a good job of stopping trials early that should stop early and letting trials continue that should be allowed to continue.

Experts in machine learning are more accustomed to the idea of inaccurate models sometimes producing good results. The best example may be naive Bayes classifiers. The word “naive” in the name is a frank admission that these classifiers model as independent events known not to be independent. These methods can do well at their ultimate goal, such as distinguishing spam from legitimate email, even though they make a drastic simplifying assumption.

There have been papers that look at why naive Bayes works surprisingly well. Naive Bayes classifiers work well when the errors due to wrongly assuming independence effect positive and negative examples roughly equally. The inaccuracies of the model sort of wash out when the model is reduced to a binary decision, classifying as positive or negative. Something similar happens with the clinical trial methods mentioned above. The ultimate goal is to make correct go/no-go decisions, not to accurately model survival times. The naive exponential assumption effects both trials that should and should not stop, and the model predictions are reduced to a binary decision.

# A subtle way to over-fit

If you train a model on a set of data, it should fit that data well. The hope, however, is that it will fit a new set of data well. So in machine learning and statistics, people split their data into two parts. They train the model on one half, and see how well it fits on the other half. This is called cross validation, and it helps prevent over-fitting, fitting a model too closely to the peculiarities of a data set.

For example, suppose you have measured the value of a function at 100 points. Unbeknownst to you, the data come from a cubic polynomial plus some noise. You can fit these 100 points exactly with a 99th degree polynomial, but this gives you the illusion that you’ve learned more than you really have. But if you divide your data into test and training sets of 50 points each, overfitting on the training set will result in a terrible fit on the test set. If you fit a cubic polynomial to the training data, you should do well on the test set. If you fit a 49th degree polynomial to the training data, you’ll fit it perfectly, but do a horrible job with the test data.

Now suppose we have two kinds of models to fit. We train each on the training set, and pick the one that does better on the test set. We’re not over-fitting because we haven’t used the test data to fit our model. Except we really are: we used the test set to select a model, though we didn’t use the test set to fit the parameters in the two models. Think of a larger model as a tree. The top of the tree tells you which model to pick, and under that are the parameters for each model. When we think of this new hierarchical model as “the model,” then we’ve used our test data to fit part of the model, namely to fit the bit at the top.

With only two models under consideration, this isn’t much of a problem. But if you have a machine learning package that tries millions of models, you can be over-fitting in a subtle way, and this can give you more confidence in your final result than is warranted.

The distinction between parameters and models is fuzzy. That’s why “Bayesian model averaging” is ultimately just Bayesian modeling. You could think of the model selection as simply another parameter. Or you could go the other way around and think of of each parameter value as an index for a family of models. So if you say you’re only using the test data to select models, not parameters, you could be fooling yourself.

For example, suppose you want to fit a linear regression to a data set. That is, you want to pick m and b so that y = mx + b is a good fit to the data. But now I tell you that you are only allowed to fit models with one degree of freedom. You’re allowed to do cross validation, but you’re only allowed to use the test data for model selection, not model fitting.

So here’s what you could do. Pick a constant value of b, call it b0. Now fit the one-parameter model y = mx + b0 on your training data, selecting the value of m only to minimize the error in fitting the training set. Now pick another value of b, call it b1, and see how well it does on the test set. Repeat until you’ve found the best value of b. You’ve essentially used the training and test data to fit a two-parameter model, albeit awkwardly.

Related: Probability modeling

# Machine learning and magic

When I first heard about a lie detector as a child, I was puzzled. How could a machine detect lies? If it could, why couldn’t you use it to predict the future? For example, you could say “IBM stock will go up tomorrow” and let the machine tell you whether you’re lying.

Of course lie detectors can’t tell whether someone is lying. They can only tell whether someone is exhibiting physiological behavior believed to be associated with lying. How well the latter predicts the former is a matter of debate.

I saw a presentation of a machine learning package the other day. Some of the questions implied that the audience had a magical understanding of machine learning, as if an algorithm could extract answers from data that do not contain the answer. The software simply searches for patterns in data by seeing how well various possible patterns fit, but there may be no pattern to be found. Machine learning algorithms cannot generate information that isn’t there any more than a polygraph machine can predict the future.

# Robust in one sense, sensitive in another

When you sort data and look at which sample falls in a particular position, that’s called order statistics. For example, you might want to know the smallest, largest, or middle value.

Order statistics are robust in a sense. The median of a sample, for example, is a very robust measure of central tendency. If Bill Gates walks into a room with a large number of people, the mean wealth jumps tremendously but the median hardly budges.

But order statistics are not robust in this sense: the identity of the sample in any given position can be very sensitive to perturbation. Suppose a room has an odd number of people so that someone has the median wealth. When Bill Gates and Warren Buffett walk into the room later, the value of the median income may not change much, but the person corresponding to that income will change.

One way to evaluate machine learning algorithms is by how often they pick the right winner in some sense. For example, dose-finding algorithms are often evaluated on how often they pick the best dose from a set of doses being tested. This can be a terrible criteria, causing researchers to be mislead by a particular set of simulation scenarios. It’s more important how often an algorithm makes a good choice than how often it makes the best choice.

Suppose five drugs are being tested. Two are nearly equally effective, and three are much less effective. A good experimental design will lead to picking one of the two good drugs most of the time. But if the best drug is only slightly better than the next best, it’s too much to expect any design to pick the best drug with high probability. In this case it’s better to measure the expected utility of a decision rather than how often a design makes the best decision.

# Independent decision making

Suppose a large number of people each have a slightly better than 50% chance of correctly answering a yes/no question. If they answered independently, the majority would very likely be correct.

For example, suppose there are 10,000 people, each with a 51% chance of answering a question correctly. The probability that more than 5,000 people will be right is about 98%. [1]

The key assumption here is independence, which is not realistic in most cases. But as people move in the direction of independence, the quality of the majority vote improves. Another assumption is that people are what machine learning calls “weak learners,” i.e. that they perform slightly better than chance. This holds more often than independence, but on some subjects people tend to do worse than chance, particularly experts.

You could call this the wisdom of crowds, but it’s closer to the wisdom of markets. As James Surowiecki points out in his book The Wisdom of Crowds, crowds (as in mobs) aren’t wise; large groups of independent decision makers are wise. Markets are wiser than crowds because they aggregate more independent opinions. Markets are subject to group-think as well, but not to the same extent as mobs.

* * *

[1] Suppose there are N people, each with independent probability p of being correct. Suppose N is large and p is near 1/2. Then the probability of a majority answering correctly is approximately

Prob( Z > (1 – 2p) sqrt(N) )

where Z is a standard normal random variable. You could calculate this in Python by

from scipy.stats import norm
from math import sqrt
print( norm.sf( (1 - 2*p)*sqrt(N) ) )


This post is an elaboration of something I first posted on Google+.

# Book review: Practical Data Analysis

Many people have drawn Venn diagrams to locate machine learning and related ideas in the intellectual landscape. Drew Conway’s diagram may have been the first. It has at least been frequently referenced.

By this classification, Hector Cuesta’s new book Practical Data Anaysis is located toward the “hacking skills” corner of the diagram. No single book can cover everything, and this one emphasizes practical software knowledge more than mathematical theory or details of a particular problem domain.

The biggest strength of the book may be that it brings together in one place information on tools that are used together but whose documentation is scattered. The book is great source for sample code. The source code  is available on GitHub, though it’s more understandable in the context of the book.

Much of the book uses Python and related modules and tools including:

• NumPy
• mlpy
• PIL
• twython
• Pandas
• NLTK
• IPython
• Wakari

It also uses D3.js (with JSON, CSS, HTML, …), MongoDB (with MapReduce, Mongo Shell, PyMongo, …), and miscellaneous other tools and APIs.

There’s a lot of material here in 360 pages, making it a useful reference.

* * *