The cold start problem

How do you operate a data-driven application before you have any data? This is known as the cold start problem.

We faced this problem all the time when I designed clinical trials at MD Anderson Cancer Center. We uses Bayesian methods to design adaptive clinical trial designs, such as clinical trials for determining chemotherapy dose levels. Each patient’s treatment assignment would be informed by data from all patients treated previously.

But what about the first patient in a trial? You’ve got to treat a first patient, and treat them as well as you know how. They’re struggling with cancer, so it matters a great deal what treatment they are assigned. So you treat them according to expert opinion. What else could you do?

Thanks to the magic of Bayes theorem, you don’t have to have an ad hoc rule that says “Treat the first patient this way, then turn on the Bayesian machine to determine how to treat the next patient.” No, you use Bayes theorem from beginning to end. There’s no need to handle the first patient differently because expert opinion is already there, captured in the form of prior distributions (and the structure of the probability model).

Each patient is treated according to all information available at the time. At first all available information is prior information. After you have data on one patient, most of the information you have is still prior information, but Bayes’ theorem updates this prior information with your lone observation. As more data becomes available, the Bayesian machine incorporates it all, automatically shifting weight away from the prior and toward the data.

The cold start problem for business applications is easier than the cold start problem for clinical trials. First of all, most business applications don’t have the potential to cost people their lives. Second, business applications typically have fewer competing criteria to balance.

What if you’re not sure where to draw your prior information? Bayes can handle that too. You can use Bayesian model selection or Bayesian model averaging to determine which source (or weighting of sources) best fits the new data as it comes in.

Once you’ve decided to use a Bayesian approach, there’s still plenty of work to do, but the Bayesian approach provides scaffolding for that work, a framework for moving forward.

Something that bothers me about deep neural nets

Overfitting happens when a model does too good a job of matching a particular data set and so does a poor job on new data. The way traditional statistical models address the danger of overfitting is to limit the number of parameters. For example, you might fit a straight line (two parameters) to 100 data points, rather than using a 99-degree polynomial that could match the input data exactly and probably do a terrible job on new data. You find the best fit you can to a model that doesn’t have enough flexibility to match the data too closely.

Deep neural networks have enough parameters to overfit the data, but there are various strategies to keep this from happening. A common way to avoid overfitting is to deliberately do a mediocre job of fitting the model.

When it works well, the shortcomings of the optimization procedure yield a solution that differs from the optimal solution in a beneficial way. But the solution could fail to be useful in several ways. It might be too far from optimal, or deviate from the optimal solution in an unhelpful way, or the optimization method might accidentally do too good a job.

It a nutshell, the disturbing thing is that you have a negative criteria for what constitutes a good solution: one that’s not too close to optimal. But there are a lot of ways to not be too close to optimal. In practice, you experiment until you find an optimally suboptimal solution, i.e. the intentionally suboptimal fit that performs the best in validation.

 

The big deal about neural networks

In their book Computer Age Statistical Inference, Brad Efron and Trevor Hastie give a nice description of neutral networks and deep learning.

The knee-jerk response [to neural networks] from statisticians was “What’s the big deal? A neural network is just a nonlinear model, not too different from many other generalizations of linear models.”

While this may be true, neural networks brought a new energy to the field. They could be scaled up and generalized in a variety of ways … And most importantly, they were able to solve problems on a scale far exceeding what the statistics community was used to. This was part computing scale expertise, part liberated thinking and creativity on the part of this computer science community.

After enjoying considerable popularity for a number of years, neural networks were somewhat sidelined by new inventions in the mid 1990’s. … Neural networks were passé. But then they re-emerged with a vengeance after 2010 … the reincarnation now being called deep learning.

Quantifying uncertainty

The primary way to quantify uncertainty is to use probability. Subject to certain axioms that aim to capture common-sense rules for quantifying uncertainty, probability theory is essentially the only way. (This is Cox’s theorem.)

Other methods, such as fuzzy logic, may be useful, though they must violate common sense (at least as defined by Cox’s theorem) under some circumstances. They may be still useful when they provide approximately the results that probability would have provided and at less effort and stay away from edge cases that deviate too far from common sense.

There are various kinds of uncertainty, principally epistemic uncertainty (lack of knowledge) and aleatory uncertainty (randomness), and various philosophies for how to apply probability. One advantage to the Bayesian approach is that it handles epistemic and aleatory uncertainty in a unified way.

Blog posts related to quantifying uncertainty:

Pros and cons of the term “data science”

I’ve resisted using the term “data science,” and enjoy poking fun at it now and then, but I’ve decided it’s not such a bad label after all.

Here are some of the pros and cons of the term. (Listing “cons” first seems backward, but I’m currently leaning toward the pro side, so I thought I should conclude with it.)

Cons

The term “data scientist” is sometimes used to imply more novelty than is there. There’s not a great deal of difference between data science and statistics, though the new term is more fashionable. (Someone quipped that data science is statistics on a Mac.)

Similarly, the term data scientist is sometimes used as an excuse for ignorance, as in “I don’t understand probability and all that stuff, but I don’t need to because I’m a data scientist, not a statistician.”

The big deal about data science isn’t data but the science of drawing inferences from the data. Inference science would be a better term, in my opinion, but that term hasn’t taken off.

Pros

Data science could be a useful umbrella term for statistics, machine learning, decision theory, etc. Also, the title data scientist is rightfully associated with people who have better computational skills than statisticians typically have.

While the term data science isn’t perfect, there’s little to recommend the term statistics other than that it is well established. The root of statistics is state, as in a government. This is because statistics was first applied to the concerns of bureaucracies. The term statistics would be equivalent to governmentistics, a historically accurate but otherwise useless term.

Adaptive clinical trials and machine learning

Arguments over the difference between statistics and machine learning are often pointless. There is a huge overlap between the two approaches to analyzing data, sometimes obscured by differences in vocabulary. However, there is one distinction that is helpful. Statistics aims to build accurate models of phenomena, implicitly leaving the exploitation of these models to others. Machine learning aims to solve problems more directly, and sees its models as intermediate artifacts; if an unrealistic model leads to good solutions, it’s good enough.

This distinction is valid in broad strokes, though things are fuzzier than it admits. Some statisticians are content with constructing models, while others look further down the road to how the models are used. And machine learning experts vary in their interest in creating accurate models.

Clinical trial design usually comes under the heading of statistics, though in spirit it’s more like machine learning. The goal of a clinical trial is to answer some question, such as whether a treatment is safe or effective, while also having safeguards in place to stop the trial early if necessary. There is an underlying model—implicit in some methods, more often explicit in newer methods—that guides the conduct of the trial, but the accuracy of this model per se is not the primary goal. Some designs have been shown to be fairly robust, leading to good decisions even when the underlying probability model does not fit well. For example, I’ve done some work with clinical trial methods that model survival times with an exponential distribution. No one believes that an exponential distribution, i.e. one with constant hazard, accurately models survival times in cancer trials, and yet methods using these models do a good job of stopping trials early that should stop early and letting trials continue that should be allowed to continue.

Experts in machine learning are more accustomed to the idea of inaccurate models sometimes producing good results. The best example may be naive Bayes classifiers. The word “naive” in the name is a frank admission that these classifiers model as independent events known not to be independent. These methods can do well at their ultimate goal, such as distinguishing spam from legitimate email, even though they make a drastic simplifying assumption.

There have been papers that look at why naive Bayes works surprisingly well. Naive Bayes classifiers work well when the errors due to wrongly assuming independence effect positive and negative examples roughly equally. The inaccuracies of the model sort of wash out when the model is reduced to a binary decision, classifying as positive or negative. Something similar happens with the clinical trial methods mentioned above. The ultimate goal is to make correct go/no-go decisions, not to accurately model survival times. The naive exponential assumption effects both trials that should and should not stop, and the model predictions are reduced to a binary decision.

Related: Adaptive clinical trial design

 

A subtle way to over-fit

If you train a model on a set of data, it should fit that data well. The hope, however, is that it will fit a new set of data well. So in machine learning and statistics, people split their data into two parts. They train the model on one half, and see how well it fits on the other half. This is called cross validation, and it helps prevent over-fitting, fitting a model too closely to the peculiarities of a data set.

For example, suppose you have measured the value of a function at 100 points. Unbeknownst to you, the data come from a cubic polynomial plus some noise. You can fit these 100 points exactly with a 99th degree polynomial, but this gives you the illusion that you’ve learned more than you really have. But if you divide your data into test and training sets of 50 points each, overfitting on the training set will result in a terrible fit on the test set. If you fit a cubic polynomial to the training data, you should do well on the test set. If you fit a 49th degree polynomial to the training data, you’ll fit it perfectly, but do a horrible job with the test data.

Now suppose we have two kinds of models to fit. We train each on the training set, and pick the one that does better on the test set. We’re not over-fitting because we haven’t used the test data to fit our model. Except we really are: we used the test set to select a model, though we didn’t use the test set to fit the parameters in the two models. Think of a larger model as a tree. The top of the tree tells you which model to pick, and under that are the parameters for each model. When we think of this new hierarchical model as “the model,” then we’ve used our test data to fit part of the model, namely to fit the bit at the top.

With only two models under consideration, this isn’t much of a problem. But if you have a machine learning package that tries millions of models, you can be over-fitting in a subtle way, and this can give you more confidence in your final result than is warranted.

The distinction between parameters and models is fuzzy. That’s why “Bayesian model averaging” is ultimately just Bayesian modeling. You could think of the model selection as simply another parameter. Or you could go the other way around and think of each parameter value as an index for a family of models. So if you say you’re only using the test data to select models, not parameters, you could be fooling yourself.

For example, suppose you want to fit a linear regression to a data set. That is, you want to pick m and b so that y = mx + b is a good fit to the data. But now I tell you that you are only allowed to fit models with one degree of freedom. You’re allowed to do cross validation, but you’re only allowed to use the test data for model selection, not model fitting.

So here’s what you could do. Pick a constant value of b, call it b0. Now fit the one-parameter model y = mx + b0 on your training data, selecting the value of m only to minimize the error in fitting the training set. Now pick another value of b, call it b1, and see how well it does on the test set. Repeat until you’ve found the best value of b. You’ve essentially used the training and test data to fit a two-parameter model, albeit awkwardly.

Related: Probability modeling

Machine learning and magic

When I first heard about a lie detector as a child, I was puzzled. How could a machine detect lies? If it could, why couldn’t you use it to predict the future? For example, you could say “IBM stock will go up tomorrow” and let the machine tell you whether you’re lying.

Of course lie detectors can’t tell whether someone is lying. They can only tell whether someone is exhibiting physiological behavior believed to be associated with lying. How well the latter predicts the former is a matter of debate.

I saw a presentation of a machine learning package the other day. Some of the questions implied that the audience had a magical understanding of machine learning, as if an algorithm could extract answers from data that do not contain the answer. The software simply searches for patterns in data by seeing how well various possible patterns fit, but there may be no pattern to be found. Machine learning algorithms cannot generate information that isn’t there any more than a polygraph machine can predict the future.

Robust in one sense, sensitive in another

When you sort data and look at which sample falls in a particular position, that’s called order statistics. For example, you might want to know the smallest, largest, or middle value.

Order statistics are robust in a sense. The median of a sample, for example, is a very robust measure of central tendency. If Bill Gates walks into a room with a large number of people, the mean wealth jumps tremendously but the median hardly budges.

But order statistics are not robust in this sense: the identity of the sample in any given position can be very sensitive to perturbation. Suppose a room has an odd number of people so that someone has the median wealth. When Bill Gates and Warren Buffett walk into the room later, the value of the median income may not change much, but the person corresponding to that income will change.

One way to evaluate machine learning algorithms is by how often they pick the right winner in some sense. For example, dose-finding algorithms are often evaluated on how often they pick the best dose from a set of doses being tested. This can be a terrible criteria, causing researchers to be mislead by a particular set of simulation scenarios. It’s more important how often an algorithm makes a good choice than how often it makes the best choice.

Suppose five drugs are being tested. Two are nearly equally effective, and three are much less effective. A good experimental design will lead to picking one of the two good drugs most of the time. But if the best drug is only slightly better than the next best, it’s too much to expect any design to pick the best drug with high probability. In this case it’s better to measure the expected utility of a decision rather than how often a design makes the best decision.

Book review: Practical Data Analysis

Many people have drawn Venn diagrams to locate machine learning and related ideas in the intellectual landscape. Drew Conway’s diagram may have been the first. It has at least been frequently referenced.

By this classification, Hector Cuesta’s new book Practical Data Analysis is located toward the “hacking skills” corner of the diagram. No single book can cover everything, and this one emphasizes practical software knowledge more than mathematical theory or details of a particular problem domain.

The biggest strength of the book may be that it brings together in one place information on tools that are used together but whose documentation is scattered. The book is great source for sample code. The source code  is available on GitHub, though it’s more understandable in the context of the book.

Much of the book uses Python and related modules and tools including:

  • NumPy
  • mlpy
  • PIL
  • twython
  • Pandas
  • NLTK
  • IPython
  • Wakari

It also uses D3.js (with JSON, CSS, HTML, …), MongoDB (with MapReduce, Mongo Shell, PyMongo, …), and miscellaneous other tools and APIs.

There’s a lot of material here in 360 pages, making it a useful reference.