Server utilization: Joel on queuing

Joel Spolsky of Joel on Software fame mentions queuing theory in the latest StackOverflow podcast. He mentions a rule of thumb that wait times go up quickly as server utilization exceeds 80%. The same principle applies whether you’re talking about computer servers or human servers. I hadn’t heard of that rule of thumb, though I knew that you don’t want anywhere near 100% utilization. Here’s the graph that justifies Joel’s remark.

graph of wait time as a function of utilization, service time fixed

The vertical axis is wait time. The horizontal access is utilization.

Here are the details. Basic queuing theory assumes customer arrivals are Poisson distributed with rate λ. Service times are exponentially distributed with rate μ. The ratio λ/μ is called utilization ρ. If this ratio is greater than 1, that says customers are arriving faster than they can be served, and so the line will grow without bound. If the ratio is less than 1, the line will reach some steady state on average.

The average waiting time is W = 1/(μ-λ). Now assume the service time μ is fixed and the arrival rate λ = ρ μ. Then W = 1/μ(1-ρ) and so the wait time is proportional to 1/(1-ρ). As the utilization ρ approaches 1, the wait time goes to infinity. The graph above plots 1/(1-ρ). As Joel said, the curve does go up quickly when ρ exceeds 0.8.

Related post: What happens when you add a new teller?

For daily posts on probability, follow @ProbFact on Twitter.

ProbFact twitter icon

Whatever happened to system administration?

You hardly hear anyone speak of “system administration” anymore. Now everything is “IT” (information technology). System administrators are now “IT professionals.” Why the change? Has it become politically incorrect to call the people who administer computer systems “system administrators”?

I’ve never liked the term “information technology.” For one thing, it has a confusing abbreviation. For another, it’s terribly vague. It would seem to include programming, but I think  programming is about the only thing it doesn’t include.

Exascale computing

In this podcast, Peter Kogge explains what it would take to increase our computing capacity by a factor of 1000. We can’t just do more of what we’re doing now and scale linearly. For example, one limitation is power consumption. Data centers now consume 1.5% of the electricity on the U.S. power grid. Nevertheless, Kogge outlines ways we could achieve a 1000-fold increase in computing capacity by 2015.

Six degrees of Paul Erdős

The goal of the game Six Degrees of Kevin Bacon is to connect actors to Kevin Bacon in as few steps as possible. For example, Elvis Presley can be linked to Kevin Bacon in two steps: Elvis Presley made a movie with Ed Asner, and Ed Asner made a movie with Kevin Bacon. The game could be played with other actors, but the advantage of using Kevin Bacon is that he’s made movies with a lot of other actors.

Mathematicians have a similar and older game, connecting each other to Paul Erdős via a chain of papers. Like Kevin Bacon, Erdős was a social hub. He wrote more math papers than anyone else and usually collaborated with other authors.

The length of the smallest chain connecting someone to Paul Erdős is called that person’s  Erdős number. Erdős has Erdős number 0. Anyone who wrote a paper with Erdős has Erdős number 1, etc. My Erdős number is 4 because I wrote a paper with Ralph Showalter, who wrote a paper with Charles Radin, who wrote a paper with John Horton Conway, who wrote a paper with Erdős. I’ve written papers with other folks with Erdős number 3, but not with anyone with a lower number, so my number is 4. I calculated my Erdős number after learning about this  web site from Dave Richeson’s six degrees of separation blog post.

Paul Erdős was a pure mathematician. I wonder who might be the Paul Erdős of applied math, someone who has written a large number of applied math articles, especially with many collaborators. It would be hard to say since applied math is a fuzzy classification, blending into other disciplines. Still, any nominations?

Related post: Publish or perish

Click to learn more about consulting for complex networks


Cost-benefit analysis versus benefit-only analysis

Hardly anyone cares about statistics directly. People more often care about decisions they need to make with the help of statistics. This suggests that the statistics and decision-making process should be explicitly integrated. The name for this integrated approach is “decision theory.” Problems in decision theory are set up with the goal of maximizing “utility,” the benefit you expect to get from a decision. Equivalently, problems are set up to minimize expected cost. Cost may be a literal monetary cost, but it could be some other measure of something you want to avoid.

I was at a conference this morning where David Draper gave an excellent talk entitled Bayesian Decision Theory in Biostatistics: the Utility of Utility.  Draper presented an example of selecting variables for a statistical model. But instead of just selecting the most important variables in a purely statistical sense, he factored in the cost of collecting each variable. So if two variables make nearly equal contributions to a model, for example, the procedure would give preference to the variable that is cheaper to collect. In short, Draper recommended a cost-benefit analysis rather than the typical (statistical) benefit-only analysis. Very reasonable.

Why don’t people always take this approach? One reason is that it’s hard to assign utilities to outcomes. Dollar costs are often easy to account for, but it can be much harder to assign values to benefits. For example, you have to ask “Benefit for whom?” In a medical context, do you want to maximize the benefit to patients? Doctors? Insurance companies? Tax payers? Regulators? Statisticians? If you want to maximize some combination of these factors, how do you weight the interests of the various parties?

Assigning utilities is hard work, and you can never make everyone happy. No matter how good of a job you do, someone will criticize you. Nearly everyone agrees in the abstract that considering utilities is the way to go, but in practice it is hardly ever done. Anyone who proposes a way to quantify utility is immediately shot down by people who have a better idea. The net result is that rather than using a reasonable but  imperfect idea of utility, no utility is used at all. Or rather no explicit definition of utility is used. There is usually some implicit idea of utility, chosen for mathematical convenience, and that one wins by default. In general, people much prefer to leave utilities implicit.

In the Q&A after his talk, Draper said something to the effect that the status quo persists for a very good reason: thinking is hard work, and it opens you up to criticism.


Click to learn more about Bayesian statistics consulting


Talent alone won’t pay the bills

Here’s a Washington Post article about an experiment. The newspaper asked world-famous violinist Joshua Bell to dress down and perform as a street musician at a DC metro stop. Hardly anyone paid attention. He earned $32.17 in tips. People routinely pay over $100 for a chance to listen to Joshua Bell in a concert hall, but he earned an average of about $0.03 per passer-by that day.

Suppose Bell were trying to support himself as a street musician. What might he say to himself after a disappointing day at the metro? That if only he had practiced a little harder he would have made more tips?

Two myths I learned in college: bathtub drains and airplane wings

Here are two things I was taught in college that were absolutely wrong.

  • The Coriolis effect explains why bathtubs drain clockwise in the northern hemisphere.
  • The Bernoulli effect explains how planes fly.

These are not things that scientists believed at the time but were later disproved. They’re simply myths. (No doubt I’ve passed on a few myths to students over the years. If I can think of any, I’ll retract them here.)

The Coriolis effect does explain why cyclones rotate one way in the northern hemisphere and the opposite way in the southern hemisphere. The rotation of the earth influences the rotation of large bodies of fluid, like weather systems. However, a bathtub would need to be maybe a thousand miles in diameter before the Coriolis effect would determine how it drains. Bathtubs drain clockwise and counterclockwise in both hemispheres. Random forces such as sound in the air have more influence than the Coriolis effect on such small bodies of water.

The explanation  I learned in college for how airplanes fly involves the Bernoulli principle. The shape of a wing is such that air has to travel further over the top of the wing than under the bottom of the wing.

Since air particles going across the top and bottom of the wing arrive at the trailing edge at the same time, the particles going over the top travel further and so are spread further apart. This results in lower pressure over the top of the wing and lifts the airplane.

There are a couple problems with this explanation. When the air particles split at the leading edge, why should the ones that go over the top and the ones that go under the bottom arrive at the trailing edge at the same time? In fact, they don’t. Also, while the Bernoulli effect does explain part of the lift on an airplane wing, the effect is an order of magnitude too small to account for why airplanes fly.

If the Bernoulli principle explains lift, how can an airplane fly upside-down? Wouldn’t the Bernoulli effect suck the plane down rather than lifting it up when you turn the wing over?

Here’s a  set of recordings from a lecture that debunks the Bernoulli effect myth and explains the real reason airplanes fly. Why airplanes fly: a modern myth. Part I, Part II.  There will be a Part III that hasn’t been released yet.

Update: Apparently these links have gone away.

Apple Mac turns 25 today

The Mac came out 25 years ago today. Here’s the article. Hat tip @divbyzero.

I sold Macs for a little while. I worked at the UT computer store one summer. At the time, that store was the biggest distributor of Macs in the country.

I’ve never owned a Mac. I think it would be great fun, but it’s hard for me to justify the expense. If someone wanted to give me a Mac, I’d start blogging about Mac stuff. 🙂

Splitting a convex set through its center

Three days ago I raised the question How unevenly can you split a convex set through its center? Scott had asked in the comments what kind of sets split most unevenly through their centers. I didn’t have an answer at the time, but now I have something.

Start with a triangle. Draw a line from the middle of the base up to the top. The center of mass is located 1/3 of the way up that line.

Now draw a horizontal line through the center of mass. The triangle above this line is similar to the entire triangle, but will have 2/3 of the original height. By similarity, it also has 2/3 of the base, and so it has 4/9 of the area of the entire triangle. The portion below the center of mass has 5/9 of the area.

Now let’s go up a dimension and imagine a tetrahedron sitting on a table. Draw a line from the center of the base up to the top. The center of mass for the tetrahedron will be located along that line, 1/4 of the way up. Pass a plane through the center of mass parallel to the table. This forms a new tetrahedron above the cutting plane, similar to the original tetrahedron. The new tetrahedron has 3/4 of the height, and by similarity, it has (3/4)3 = 27/64 of the original volume. Of course the solid below the cutting plane then has 37/64 of the original volume.

Now consider the general case. The center of mass of an n-simplex is located 1/(n+1) of the way up along a line running from the center of the base to the top. Cut the n-simplex through the center of mass with a hyperplane parallel to the base. The n-simplex above the hyperplane has n/(n+1) of the original height and (n/(n+1))n of the original volume. As n goes to infinity, this expression converges to 1/e.

In summary, we’ve shown that a convex set in Rn can have at least 1 – (n/(n+1))n of its volume on one side of a hyperplane through the center of mass by constructing such a set, namely any n-simplex. And as n goes to infinity, this volume approaches 1 – 1/e. We have not shown that an n-simplex is the worst case in each dimension n, though I suspect this is true.  If you take a ball in Rn, any plane through the center divides the ball exactly in half. In some sense, an n-simplex is as far as a convex set can get from a ball. It’s the least-round convex shape.

Update: Here’s a paper on the topic. B. Grunbaum, “Partitions of mass-distributions and convex bodies by hyperplanes,” Pacific Journal of Mathematics. 10, 1257–1261, 1960.

Twitter is not micro-blogging

Twitter is often described as a micro-blogging platform. Twitter posts and like blog posts, except they’re limited to 140 characters (so they fit in a cell phone text message). You subscribe to Twitter posts (called “tweets”) sorta like you subscribe to a blog. Some people, like Kathy Sierra, do use Twitter for micro-blogging. Her tweets are little self-contained messages, often one sentence. Here’s a recent example:

Much as I liked Outliers, makes me cringe to see people focus on “it’s all luck/chance” rather than the “it takes 10,000 hours” part.

Nice observation, all in 133 characters. (She’s talking about Malcolm Gladwell’s book Outliers. He argues that success is a matter of accumulated lucky advantages plus around 10,000 hours of deliberate practice.)

Maybe the most common form of micro-blogging is link sharing. Here’s an example, again from Kathy Sierra.

…and for those whose head literally explodes over misuse of the word “literally”

Some people use Twitter as a question-and-answer forum. This is sorta like blogging and inviting comments, and it can be very powerful.

But a lot of traffic on Twitter is not what I’d consider micro-blogging.  It’s more like a public form of instant messaging. I found this disorienting when I first started using Twitter. Who are they talking to? Context?! Why would I want everyone to see my instant messages? I suppose it’s an acquired taste.

Everyone on Twitter has some mixture of micro-blogging, Q&A, and instant messaging. Some people love the instant messaging-style conversations. To each his own. My preferred mix is weighted toward micro-blogging and Q&A.

I’m on Twitter at @johndcook.

Update (29 March 2010): It’s been more than a year since I first wrote this post. I now use the instant messaging aspect of Twitter a little more than I did then, though I still prefer the micro-blogging aspect. And I’ve created several daily tip accounts that are pure microblogs.

Probability distributions and object oriented programming

This post looks at applying object oriented programming ideas to probability distribution software. It explains the Liskov Substitution Principle and shows how it can keep you from falling into a subtle trap.

One of the big ideas in object-oriented programming is to organize software into units of data and related functions that represent things in your problem domain. These units are called classes. Particular instances of classes are called objects. For example, a business application could have a customer class. Particular customers are represented by customer objects.

The C++ numerical library that we developed at MDACC has classes that represent probability distributions. A probability distribution class contains methods (functions) such as PDF, CDF, Mean, Mode, etc.  For example, the NormalDistribution class represents normal distributions. Particular NormalDistribution objects each have their own mean and variance parameters.

Another big idea of object-oriented programming is inheritance, a way to organize classes into a hierarchy. This is where things get more interesting.

Inheritance is commonly described as an “is a” relationship. That explanation is often helpful, but sometimes it can get you into trouble. (Listen to this interview with Robert Martin for an explanation.) Probability distributions illustrate when “is a” should be represented by inheritance and when it should not be.

A beta distribution is a continuous distribution. So is a normal distribution. The BetaDistribution and NormalDistribution classes representing these probability distributions both derive from ContinuousDistribution class. This makes it possible to write generic code that operates on continuous distributions. Later we could pass in a particular type of  continuous distribution rather than having to write special code for every kind of continuous distribution.

Now think about a chi square distribution. A chi square distribution with ν degrees of freedom is a gamma distribution with shape ν/2 and scale 2. So in a mathematical sense, a chi square distribution “is a” gamma distribution. But should a class representing a chi square distribution inherit from a class representing a gamma distribution? The surprising answer is “no.” A rule called the “Lyskov Substitution Principle” (LSP) says this is a bad idea.

When a class X inherits from a class Y, we say X is the derived class and Y is the base class. The LSP says code should work without surprises when an instance of a derived class is passed into a function written to receive instances of the base class.  Deriving a BetaDistribution class from a ContinuousDistribution class should not lead to any surprises. A function that handles continuous distributions in general should work just fine when you give it a specific distribution such as a beta, normal distribution, etc.

Now suppose we derive our ChiSquareDistribution class from the GammaDistribution class. Suppose also we have a function that expects a GammaDistribution object. What happens if we pass it a ChiSquareDistribution? Maybe the function works with no surprises. If the function calls methods like PDF or CDF there’s no problem. But what if the function calls a SetParameters method specifies the shape and scale of the distribution? Now we have a problem. You can’t set the shape and scale independently for a chi square distribution.

If you try to make this work, you’re going to dig yourself into a hole. The code can’t be intuitive: two people could have reasonable but different expectations for how the code should behave. And attempts to patch the situation are only going to make things worse, introducing awkward dependencies and generally entangling the code. The LSP says don’t go there. From an object oriented programming view point, the gamma and chi square distributions are simply unrelated. Neither derives from the other.

The canonical explanation of the LSP uses squares and rectangles. Geometrically, a square is a special type of rectangle. But should a Square class derive from a Rectangle class? The LSP says no. You can’t set the length and width of a square independently. What should a Square class do when someone tries to set its length and width? Ignore one of them? Which one? Suppose you just use the length input and set the width equal to the length. Now you’ve got a surprise: setting the length changes the width, not something you’d expect of rectangles. Robert Martin does a good job of explaining this example in the interview mentioned above. On the other hand, if a Square class and a Rectangle class both derive from a Shape class, code written to act on Shape objects will work just fine when passed either Square objects or Rectangle objects.

Programmers will argue till they’re blue in the face over whether a Square “is a” Rectangle, or vice versa, or neither. The resolution to the argument is that inheritance does not mean “is a.” The idea of “is a” is often useful when thinking about inheritance, but not always. Of course a square is a rectangle, but that does not mean it’s wise to derive a Square class from a Rectangle class. Inheritance actually has to do with interface contracts. The pioneers of object oriented programming did not use the term “is a” for inheritance. That terminology came later.

So although a chi square distribution is a gamma distribution, a ChiSquareDistribution class should not inherit from a GammaDistribution class, just as a square is a rectangle but a Square class should not inherit from a Rectangle class. On the other hand, chi square and gamma distributions are continuous distribution, and it’s fine for ChiSquareDistribution and GammaDistribution classes to inherit from a ContinuousDistribution class, just as it’s fine for Square and Rectangle classes to derive from a Shape class. The difference is a matter of software interface functionality and not philosophical classification.

Click to find out more about consulting for statistical computing


Free optimization software from Microsoft

This morning I stumbled across Microsoft Solver Foundation, optimization software developed in C#. The site only mentions a free “express edition.” Sounds like they’re releasing a free version first and may sell an upgrade in the future.

Here are the details of the algorithms supported.

  • Revised, Simplex Linear and Mixed Integer Programming (Primal and Dual Simplex)
  • Interior Point Method Linear and Quadratic Programming
  • Constraint Programming with Exhaustive Tree Search, Local Search, and Metaheuristic Techniques
  • Compact, Quasi-Newton (L-BFGS), Unconstrained Nonlinear Programming

Microsoft is inconsistent in its support for numerical computing. For example, Visual Studio’s math.h implementation does not include the mathematical functions that are included in POSIX systems. And their implementation of C++ TR1 does not yet include the specified mathematical functions. On the other hand, they have produced products like Windows HPC Server and Solver Foundation. Clearly some people at Microsoft care about numerical computing.

Click to find out more about consulting for numerical computing


Distribution of time customers spend in coffee shops

How would you model the time customers spend in a coffee shop?

This post is pure speculation based on no hard data whatsoever, which makes things considerably easier! If anyone has data or suggestions, please leave a comment. Here goes a first attempt.

The time people spend in a coffee shop depends on why they are there.

  1. Some grab their coffee and go.
  2. Some are there to visit with a friend.
  3. Some drink their coffee (alone) and leave.
  4. Some are there to work.

Each group would have its own time distribution, and the overall distribution would be a mixture of these distributions. Since I’m doing this for fun, I’ll ignore (1) and (2) and just concentrate on (3) and (4). I’ll also ignore complications such as how patterns change throughout the day and how they change according to the day of the week.

Say someone comes in alone to have a cup of coffee. Maybe they stay an average of 15 minutes. I’ll assume the time these folks spend in a coffee shop is normally distributed. Not many stay more than 30 minutes, so let’s say the standard deviation is 5 minutes. That would put only about 0.4% staying longer than 30 minutes. It would be more realistic to truncate the distribution at zero to eliminate the small probability of spending negative time in the coffee shop (!) and  skew the distribution a little to the  right, giving more probability to people staying more than 30 minutes.

The people who come to the coffee shop to work stay considerably longer than the folks who are just there to drink a cup of coffee. And their time distribution would be heavily skewed. These folks are unlikely to stay less than 30 minutes, so the distribution would drop off sharply on the left. There’s a wide variety of how long people might work, so I’d expect a long tail to the right. The inverse gamma distribution fits this description. Say there’s a 5% chance that a worker will stay less than 30 minutes, and a 5% chance they’ll stay more than two hours. Using this software to solve for parameters, we find a shape parameter of 6.047 and a scale parameter of 317.3 fits the time distribution in minutes. This distribution has a mean of about 63 minutes, which I suppose is reasonable.

Here’s what the graphs of the two distributions would look like: a symmetric distribution centered at 15 minutes for the drinkers and a skewed distribution centered around 63 minutes for the workers.

Now suppose 70% of customers are drinkers and 30% are workers. Then the mixture distribution would look like this.

As the percentage of workers goes down, so does the second hump in the graph. If a coffee shop had about 20% drinkers and 80% workers, the two humps would be about the same height.

How would you include people who come to a coffee shop with a friend?