Beta inequalities in R

Someone asked me yesterday for R code to compute the probability P(X > Y + δ) where X and Y are independent beta random variables. I’m posting the solution here in case it benefits anyone else.

For an example of why you might want to compute this probability, see A Bayesian view of Amazon resellers.

Let X be a Beta(a, b) random variable and Y be a Beta(c, d) random variable. Denote PDFs by f and CDFs by F. Then the probability we need is

P(X > Y + \delta) &=& \int_\delta^1 \int_0^{x-\delta} f_X(x) \, f_Y(y)\, dy,dx \ &=& \int_\delta^1 f_X(x)\, F_Y(x-\delta) \, dx

If you just need to compute this probability a few times, here is a desktop application to compute random inequalities.

But if you need to do this computation repeated inside R code, you could use the following.

beta.ineq <- function(a, b, c, d, delta)
{
    integrand <- function(x) { dbeta(x, a, b)*pbeta(x-delta, c, d) }
    integrate(integrand, delta, 1, rel.tol=1e-4)$value
}

The code is as good or as bad as R’s integrate function. It’s probably accurate enough as long as none of the parameters a, b, c, or d are near zero. When one or more of these parameters is small, the integral is harder to compute numerically.

There is no error checking in the code above. A more robust version would verify that all parameters are positive and that delta is less than 1.

Here’s the solution to the corresponding problem for gamma random variables, provided delta is zero: A support one-liner.

And here is a series of blog posts on random inequalities.

Personalized medicine

When I hear someone say “personalized medicine” I want to ask “as opposed to what?”

All medicine is personalized. If you are in an emergency room with a broken leg and the person next to you is lapsing into a diabetic coma, the two of you will be treated differently.

The aim of personalized medicine is to increase the degree of personalization, not to introduce personalization. In particular, there is the popular notion that it will become routine to sequence your DNA any time you receive medical attention, and that this sequence data will enable treatment uniquely customized for you. All we have to do is collect a lot of data and let computers sift through it. There are numerous reasons why this is incredibly naive. Here are three to start with.

  • Maybe the information relevant to treating your malady is in how DNA is expressed, not in the DNA per se, in which case a sequence of your genome would be useless. Or maybe the most important information is not genetic at all. The data may not contain the answer.
  • Maybe the information a doctor needs is not in one gene but in the interaction of 50 genes or 100 genes. Unless a small number of genes are involved, there is no way to explore the combinations by brute force. For example, the number of ways to select 5 genes out of 20,000 is 26,653,335,666,500,004,000. The number of ways to select 32 genes is over a googol, and there isn’t a googol of anything in the universe. Moore’s law will not get us around this impasse.
  • Most clinical trials use no biomarker information at all. It is exceptional to incorporate information from one biomarker. Investigating a handful of biomarkers in a single trial is statistically dubious. Blindly exploring tens of thousands of biomarkers is out of the question, at least with current approaches.

Genetic technology has the potential to incrementally increase the degree of personalization in medicine. But these discoveries will require new insight, and not simply more data and more computing power.

Related posts

Big data cube

Erik Meijer’s paper Your Mouse is a Database has an interesting illustration of “The Big Data Cube” using three axes to classify databases.

The volume axis is big vs. small, or perhaps better, open vs. closed. Relational databases can be large, and non-relational databases can be small. But the relational database model is closed in the sense that “it assumes a closed world that is under full control by the database.”

The velocity axis is (synchronous) pull vs. (asynchronous) push. The variety axis captures whether data is stored by foreign-key/primary-key relations or key-value pairs.

Here are the corners identified by the paper:

  • Traditional RDBMS (small, pull, fk/pk)
  • Hadoop HBase (big, pull, fk/pk)
  • Object/relational mappers (small, pull, k/v)
  • LINQ to Objects (big, pull, k/v)
  • Reactive Extensions (big, push, k/v)

How would you fill in the three corners not listed above?

Related links:

The anti-JavaScript

The problems with JavaScript come from premature standardization. The language’s author Brendan Eich said

I had to be done in ten days or something worse than JS would have happened.

For a programming language designed in 10 days, he did an amazing job. Maybe he did too good a job: his first draft was good enough to use, and so he never got a chance to fix the language’s flaws.

The opposite of JavaScript may be Perl 6. The language has been in the works for 12 years and is still in development, though there are compilers you can use today. An awful lot of thought has gone into the language’s design. Importantly, some early design decisions were overturned after the community had time to think, a luxury JavaScript never had.

Perl 6 has gotten a lot of ridicule for being so slow to come out, but it may have the last laugh. Someone learning Perl 6 in the future will not care how long the language was in development, but they will appreciate that the language was very thoughtfully designed.

* * *

Another contrast between JavaScript and Perl 6 is their names. Netscape gave JavaScript a deliberately misleading name to imply a connection to the Java language. The Perl 6 name honestly positions the new language as a successor to Perl 5.

Perl 6 really is a new language, compatible in spirit with earlier versions of Perl though not always in syntax. Damian Conway has suggested that perhaps Perl 6 should have been developed under a completely different name. Then after it was completed, the developers could announce, “Oh, and by the way, this language is the upgrade path for Perl.”

If you think of Perl 6 as a new language, your expectations are quite different than if you think of it as an upgrade. If it’s a new language, it doesn’t matter so much how long it was in development. Perl programmers would be pleased with how similar the new language is to their familiar one, rather than upset about the differences. And people would evaluate the new language on its merits rather than being prejudiced by previous experience with Perl.

More programming posts

Probability question

Suppose you have a large number of buckets and an equal number of balls. You randomly pick a bucket to put each ball in one at a time. When you’re done, about what proportion of buckets will be empty?

One line of reasoning says that since you have as many balls as buckets, each bucket gets one ball on average, so nearly all the buckets get a ball.

Another line of reasoning says that randomness is clumpier than you think. Some buckets will have several balls. Maybe most of the balls will end up buckets with more than one ball, and so nearly all the buckets will be empty.

Is either extreme correct or is the answer in the middle? Does the answer depend on the number n of buckets and balls? (If you look at the cases n = 1 and 2, obviously the answer depends on n. But how much does it depend on n if n is large?) Hint: There is a fairly simple solution.

What applications can you imagine for the result?

Other quiz/puzzle posts

Pipelines and whirlpools

Walter Bright made an interesting point in his talk at GOTO Aarhus this afternoon. He says that software developers like to think of their code as pipelines. Data comes in, goes through an assembly line-like series of processing steps, and comes out the end.

input  -> step1 -> step 2 -> step3 -> output

And yet code hardly ever looks like that. Most software looks more like a whirlpool than a pipeline. Data swirls around in loops before going down the drain.

Arthur Rackham’s illustration to Poe’s Descent Into the Maelstrom.

These loops make it hard to identify parts that can be extracted into encapsulated components that can be snapped together or swapped out. You can draw a box around a piece of an assembly line and identify it as a component. But you can’t draw a box around a portion of a whirlpool and identify it as something than can be swapped out like a black box.

It is possible to write pipe-and-filter style apps, but the vast majority of programmers find loops more natural to write. And according to Walter Bright, it takes the combination of a fair number of programming language features to pull it off well, features which his D language has. I recommend watching his talk when the conference videos are posted. (Update: This article contains much of the material as the conference talk.)

I’ve been thinking about pipeline-style programming lately, and wondering why it’s so much harder than it looks. It’s easy to find examples of Unix shell scripts that solve some problem by snapping a handful of utilities together with pipes. And yet when I try to write my own software like that, especially outside the context of a shell, it’s not as easy as it looks. Brian Marick has an extended example in his book that takes a problem that appears to require loops and branching logic, but that can be turned into a pipeline. I haven’t grokked his example yet; I want to go back and look at it in light of today’s talk.

Related posts

Open question turned into exercise

G. H. Hardy tells the following story about visiting his colleague Ramanujan.

I remember once going to see him when he was ill at Putney. I had ridden in taxi cab number 1729 and remarked that the number seemed to me rather a dull one, and that I hoped it was not an unfavorable omen. “No,” he replied, “it is a very interesting number; it is the smallest number expressible as the sum of two cubes in two different ways.”

This story has become famous, but the rest of the conversation isn’t as well known. Hardy followed up by asking Ramanujan what the corresponding number would be for 4th powers. Ramanujan replied that he did not know, but that such a number must be very large.

Hardy tells this story in his 1937 paper “The Indian Mathematician Ramanujan.” He gives a footnote saying that Euler discovered 635318657 = 158^4 + 59^4 = 134^4 + 133^4 and that this was the smallest number known to be the sum of two fourth powers in two ways. It seems odd now to think of such questions being unresolved. Today we’d ask Hardy “What do you mean 635318657 is the smallest known example? Why didn’t you write a little program to find out whether it really is the smallest?”

Surely someone has since written such a program and settled the question. But as an exercise, imagine the question is still open. Write a program to either come up with a smaller number that answer’s Hardy’s question, or prove that Euler’s number is the smallest one. To make the task more interesting, you might see whether you could do a little pencil-and-paper math up front to reduce the amount searching needed. Also, you might try writing the program in different styles: imperative, functional, etc.