New data, not just bigger data

The Insight 2015 conference highlighted some impressive applications of big data: predicting the path of hurricanes more accurately (as we saw with hurricane Patricia), improving the performance of athletes, making cars safer, etc.

These applications involve large amounts of data. But more importantly they involve new data, not simply greater quantities of data we’ve had before. Cheap sensors make it possible to measure things more directly and in higher resolution than before. We have sources of data, such as social media, that are qualitatively different from what we’ve had in the past.

Simply saying we have more data than before obscures what’s happening. For example, we don’t know more about consumer behavior than a generation ago because we do more phone surveys and have more customer satisfaction post cards to fill out. We know more because we can observe things we couldn’t observe before.

Clever analysis deserves some credit for the successes of big data, but more credit goes to new sources of data and the technologies that make these sources possible.

Balancing profit and learning in A/B testing

A/B testing, or split testing, is commonly used in web marketing to decide which of two design options performs better. If you have so many visitors to a site that the number of visitors used in a test is negligible, conventional randomization schemes are the way to go. They’re simple and effective.

But if you have less traffic so that the number of visitors involved in a test is appreciable, you might be concerned with possible lost revenue during the test itself. The point of A/B testing is to improve profitability after the test, not during the test. If you also want to consider profitability during the test, you might want to consider more alternatives.

My experience with testing comes from a context where the stakes are higher than improving conversion on websites: treating cancer patients. You want to find out which treatments performed better for the sake of future patients, those who were treated after the randomized trial. But you also want to treat the participants in the clinical trial effectively. Two ways we would do that are early stopping rules and adaptive randomization. Both practices are applicable to A/B testing web pages.

A conventional clinical trial might take a few hundred patients and randomize half to one treatment and half to another. But if one treatment appears to be much more effective, at some point it becomes unconscionable to keep assigning the less effective treatment. So you stop the experiment early. You might want to do the same with web designs. If you planned to show two variations of a page to 500 visitors each, but after 100 visitors it’s obvious which version is performing better, you’d like to stop the test and show everyone the better page. On the other hand, if you have so many visitors that you’re not concerned with what happens to the 1000 visitors in the test, just let the test run to completion.

Another approach is to compromise between equal randomization and early stopping. Suppose A is performing better than B, but not so much better that you’re willing to stop and declare A the winner. You might keep randomizing, but increase the probability that the test will assign A. If A really is better, more visitors will see the better page. But if you’re wrong and B is really better, you may still discover this because some visitors are still seeing B. If B keeps performing better, the tide will turn and the test will prefer it. This is called adaptive randomization. The more evidence there is that one version is better, the higher the probability that you’ll show people that version.

One way to use adaptive randomization is variable experiment sizes. Instead of deciding a test size in advance, you test until you’re satisfied that you’ve found a winner. That may require fewer visitors than a conventional A/B test. It may also require more, but only when there’s a good reason to. The test may go into overtime, so to speak, because the two versions are performing similarly, in which case you’d like to keep testing longer to find which is better.

It’s easy to fall into thinking that the winner of a test will be used forever, whether you’re testing web pages or cancer treatments. But this isn’t the case. The winner will eventually be tested against something else, maybe very soon. This means that you might want to put a little more emphasis on the performance during the test and not just performance after the test, because there may not be much opportunity for performance after the test.

Related posts

Insight 2015

A few weeks ago I got a message on Twitter saying that IBM’s Watson had identified me as an “influencer” and invited me to the company’s Insight 2015 conference. So that’s where I am this week.

I had a brief interview last night. Someone took this photo as we were setting up.

Impulse response

You may expect that a burst of input will cause a burst of output. Sometimes that’s the case, but often a burst of input results in a long, smoothly decreasing succession of output. You may not get immediate results, but long-term results. This is true of life in general, but it’s also true in a precise sense of differential equations.

One of the surprises from differential equations is that an infinitely concentrated input usually results in a diffuse output. A fundamental solution to a differential equation is a solution to the equation with a Dirac delta as the forcing function. In a sense, your input is so concentrated that it’s not actually a function. And yet the output may be a nice continuous function, and not one that is not particularly concentrated.

The situation is analogous to striking a bell. The input, the hammer blow to the bell, is extremely short, but the response of the bell is long and smooth. Solving a differential equation with a delta function as input is like learning about a bell by listening to how it rings when you strike it. A better analogy would be striking the bell in many places; a fundamental solution actually solves for a delta function with a position argument, not just a single delta function.

If you’re curious how this informal talk of “infinitely concentrated” input and delta “functions” can be made rigorous, start by reading this post.

Related post: Life lessons from differential equations

Permutations and tests

Suppose a test asks you to place 10 events in chronological order. Label these events A through J so that chronological order is also alphabetical order.

If a student answers BACDEFGHIJ, then did they make two mistakes or just one? Two events are in the wrong position, but they made one transposition error. The simplest way to grade such a test would be to count the number of events that are in the correct position. Is this the most fair way to grade?

If you decide to count how many transpositions are needed to correct a student’s answer, do you count any transposition or only adjacent transpositions? For example, if someone answered JBCDEFGHIA, then transposing the A and the J is enough to put the results in order. But reversing the first and last event seems like a bigger mistake than reversing the first two events. Counting only adjacent transpositions would penalize this mistake more. You would have to swap the J with each of the eight letters between J and A. But it hardly seems that answering JBCDEFGHIA is eight times worse than answering BACDEFGHIJ.

Maybe counting transpositions is too much work. So we just go back to counting how many events are in the right place. But then suppose someone answers JABCDEFGHI. This is completely wrong since every event is in the wrong position. But the student obviously knows something, since the relative order of nearly all of the events is correct. From one perspective there was only one mistake: J comes last, not first.

What is the worst possible answer? Maybe getting the order exactly backward? If you have an odd number of events, then getting the order backward means one event is in the right place, and so that doesn’t receive the lowest possible score.

This is an interesting problem beyond grading exams. (As for grading exams, I’d suggest simply not using questions of this type on an exam.) In manufacturing, how serious a mistake is it to reverse two consecutive components versus two distant components? You could also ask the same question when comparing DNA sequences or other digital signals. The best way to assign a distance between the actual and desired sequence would depend entirely on context.

How did our ancestors sleep?

Electric lighting has changed the way we sleep, encouraging us to lose sleep by staying awake much longer after dark than we otherwise would.

Or maybe not. A new study of three contemporary hunter-gatherer tribes found that they stay awake long after dark and sleep an average of 6.5 hours a night. They also don’t nap much [1]. This suggests the way we sleep may not be that different from our ancient forebears.

Historian A. Roger Ekirch suggested that before electric lighting it was common to sleep in two four-hour segments with an hour or so of wakefulness in between. His theory was based primarily on medieval English texts that refer to “first sleep” and “second sleep” and has other literary support as well. A small study found that subjects settled into the sleep pattern Ekirch predicted when they were in a dark room for 14 hours each night for a month. But the hunter-gatherers don’t sleep this way.

Maybe latitude is an important factor. The hunter-gatherers mentioned above live between 2 and 20 degrees south of the equator whereas England is 52 degrees north of the equator. Maybe two-phase sleep was more common at high latitudes with long winter nights. Of course there are many differences between modern/ancient [2] hunter-gatherers and medieval Western Europeans besides latitude.

Two studies have found two patterns of how people sleep without electric lights. Maybe electric lights don’t have as much impact on how people sleep as other factors.

Related post: Paleolithic nonsense

* * *

[1] The study participants were given something like a Fitbit to wear. The article said that naps less than 15 minutes would be below the resolution of the monitors, so we don’t know how often the participants took cat naps. We only know that they rarely took longer naps.

[2] There is an implicit assumption that the contemporary hunter-gatherers live and, in particular, sleep like their ancient ancestors. This seems reasonable, though we can’t be certain. There is also the bigger assumption that the tribesmen represent not only their ancestors but all paleolithic humans. Maybe they do, and we don’t have much else to go on, but we don’t know. I suspect there was more diversity in the paleolithic era than we assume.

Fibonacci formula for pi

Here’s an unusual formula for pi based on the product and least common multiple of the first m Fibonacci numbers.

 

\pi = \lim_{m\to\infty} \sqrt{\frac{6 \log F_1 \cdots F_m}{\log \mbox{lcm}( F_1, \ldots, F_m )}}

Unlike the formula I wrote about a few days ago relating Fibonacci numbers and pi, this one is not as simple to prove. The numerator inside the root is easy enough to estimate asymptotically, but estimating the denominator depends on the distribution of primes.

Source: Yuri V. Matiyasevich and Richard K. Guy, A new formula for π, American Mathematical Monthly, Vol 93, No. 8 (October 1986), pp. 631-635.

 

PACE: Property Assessed Clean Energy

Energy efficiency improvements can pay for themselves in the long run. Financing can make the improvements immediately cash-flow positive, but only if the loan tenor can match the useful life of the equipment. This enables the payments to be low enough that the projected energy savings exceeds the payments.

PACE, which stands for Property Assessed Clean Energy, is a nation-wide program that makes long-term financing available for energy upgrades, repaid through an annual assessment added to your property tax bill. Though it is a national initiative, each state must create its own PACE program, and so there is some variety in the forms PACE can take. Texas passed legislation in June 2013 authorizing local governments to implement PACE programs. Yesterday the Houston City Council unanimously passed a Resolution of Intent to adopt PACE.

I have been working with PACE Houston, a private company that develops PACE projects by providing strategic advice to property owners. If you’re in the Houston area and are interested in help with PACE financing, you could contact me, or go to the contact page on the PACE Houston website.

A rose by any other name: Data science etc.

I help people make decisions in the face of uncertainty. Sounds interesting.

I’m a data scientist. Not sure what that means, but it sounds cool.

I study machine learning. Hmm. Maybe interesting, maybe a little ominous.

I’m into big data. Exciting or passé, depending on how many times you’ve heard the term.

Even though each of these descriptions makes a different impression, they’re all essentially the same thing. You could throw in a few more terms too, like artificial intelligence, inferential science, decision theory, or inverse probability.

There are distinctions. These terms don’t entirely overlap, but the overlap is huge. They all have to do with taking data and making an inference.

“Decision-making under uncertainty” emphasizes that you never have complete data, and yet you need to make decisions anyway. “Decision theory” emphasizes that the whole point of analyzing data is to do something as a result, and suggests that focusing directly on the decision itself, rather than proxies along the way, is the best way to do this.

“Data science” stresses that there is more to the process of making inferences than what falls under the traditional heading of “statistics.” Statistics has never been only about “the grotesque phenomenon generally known as mathematical statistics,” as Francis Anscombe described it. Things like data cleaning and visualization have always been part of the practice of statistics, though not the theory of statistics. Data science also emphasizes the role of computation. Some say a data scientist is a statistician who can program. Some say data science is statistics on a Mac.

Despite the hype around the term data science, it’s growing on me. It has its drawbacks, but so does every other name.

Machine learning, like decision theory, emphasizes the ultimate goal of doing something with data rather than creating an accurate model of the process that generates the data. If you can create such a model, so much the better. But it may not be necessary to have a great model in order to accomplish what you originally set out to do. “Naive Bayes,” for example, is a classification algorithm that is admittedly naive. It knowingly makes a gross simplification, assuming events are independent that we know are certainly not independent, and yet it often works well enough.

“Big data” is a big can of worms. It is often concerned with data sets that are indeed big, but it also implies other things, such as the way the data become available, as a real time stream rather than as a complete static set. See Erik Meijer’s Big data cube. And that’s just when the term “big data” is used in some fairly meaningful way. It’s also used so broadly as to be meaningless.

The term “statistics” literally means the mathematics of the interests of states, as in governments, because these were the first applications of statistics. So while “statistics” may be the most established and perhaps most respectable term discussed here, it’s not great. As I remarked here, “The term statistics would be equivalent to governmentistics, a historically accurate but otherwise useless term.” Statistics emphasizes probability models and mathematical rigor more than other variations on data analysis do. Statisticians criticize machine learning folks for being sloppy. Machine learning folks criticize statisticians for being too conservative, or for being too focused on description and not focused enough on prediction.

Bayesian statistics is much older than what is now sometimes called “classical” statistics. It was essential dormant during the first half of the 20th century before experiencing a renaissance in the second half of the century. Bayesian statistics was originally called “inverse probability” for good reason. Probability theory takes the probabilities of events as given and makes inferences about possible outcomes. Bayesian statistics does the inverse, taking data as given and inferring the probabilities that lead to the data. All statistics does something like this, but Bayesian statistics is consistent in forming all inference directly as probabilities. Frequetist (“classical”) statistics also infers probabilities, but the results, things like p-values and confidence intervals, are not the probabilities of what most people think they are. See Anthony O’Hagan’s description here.

Data analysis has gone by many names over time, sometimes with meaningful distinctions and sometimes not. Often people make a distinction without a difference.

Fibonacci numbers, arctangents, and pi

Here’s an unusual formula for π. Let Fn be the nth Fibonacci number. Then

\pi = 4 \sum_{n=1}^\infty \arctan\left( \frac{1}{F_{2n+1}} \right)

As mysterious as this equation may seem, it’s not hard to prove. The arctangent identity

\arctan\left(\frac{1}{F_{2n+1}}\right) = \arctan\left(\frac{1}{F_{2n}}\right) - \arctan\left(\frac{1}{F_{2n+2}}\right)

shows that the sum telescopes, leaving only the first term, arctan(1) = π/4. To prove the arctangent identity, take the tangent of both sides, use the addition law for tangents, and use the Fibonacci identity

F_{n+1} F_{n-1} - F_n^2 = (-1)^n

See this post for an even more remarkable formula relating Fibonacci numbers and π.