Random sampling to save money

I was stunned when my client said that a database query that I asked them to run would cost the company $100,000 per year. I had framed my question in the most natural way, not thinking that at the company’s scale it would be worth spending some time thinking about the query.

Things have somewhat come full circle. When computers were new, computer time was expensive and programmer time was dirt cheap by comparison. Someone might be scolded for using computer time to do what a programmer could do. Now of course programmer time is expensive and computer time is cheap. Usually.

But when dealing with gargantuan data sets, the cost of computer time might matter a great deal, especially when renting services from a cloud provider. Maybe you can find a more clever way to run an enormous query. Or even better, maybe you can find a way to avoid running an enormous query.

One way to save time and money is to base decisions on random samples rather than exhaustive queries. This requires a little preparation. How big a sample do you need? How exactly are you going to take a sample? What kind of uncertainty do you have in your result when you’re done? You can afford to think about these questions a long time if it saves tens of thousands of dollars per year.

One thought on “Random sampling to save money”

Ross

12 January 2022 at 14:21

Great post and I’d be happy to know more about this one!

It may interest you to know that a lot of the popular monitoring and metrics collection systems (if you like “determining the cost of current in-flight operations“) work by sampling only, and further, they dynamically sample —they’re sensitive to the current busyness level of the system and reduce sampling so as not to present additional load.

Maybe there was a cost analog to this dynamic sampling that might be of use to you?

… interesting things in every corner!

Comments are closed.