How to calculate percentiles in memory-bound applications

I just published a new article on CodeProject: Calculating percentiles in memory-bound applications.

Finding the percentiles of a list of samples is trivial if you can read all the samples into memory and sort the list. If the list is too big to fit into memory, it’s still not terribly difficult, but there are a couple subtleties if you haven’t thought about it before.

The application that motivated the code presented in the article generates a large amount of data. You could think of the software as generating an enormous matrix one row at a time, then computing percentiles of each column. For example, in one problem the matrix had 800,000 columns and 2,000 rows of floating point numbers.

Random number generator controversy

I submitted an article to Code Project yesterday, Simple Random Number Generation, describing a small C# class called SimpleRNG that uses George Marsaglia’s WMC algorithm. The article was posted around 5 PM (central US time) and comments started pouring in right away. I didn’t expect any feedback on a Friday afternoon or Saturday morning. But as I write this post, there have been 580 page views and 11 comments.

There have been three basic questions raised in the comments.

  1. Why not just use the random number generator that comes with .NET?
  2. Is this code suitable for cryptography?
  3. Is this code suitable for Monte Carlo applications?

Why not use the built-in generator? For many applications, the simplest thing would be to use the .NET random number generator. But there are instances where this might not be best. There are questions about the statistical quality of the .NET generator; I’ll get to that in a minute. The primary advantages I see to the SimpleRNG class are transparency and portability.

By transparency I mean that the internal state of the generator is simple and easy to access. When you’re trying to reproduce a result, say while debugging, it’s convenient to have full access to the internal state of the random generator. If you’re using your own generator, you can see everything. You can even temporarily change it: for debugging, it may be convenient to temporarily have the “random” generator return a very regular, predictable sequence.

By portability I do not necessarily mean moving the code between operating systems. The primary application I have in mind is moving the algorithm between languages. For example, in my work we often have prototype code written in R that needs to be rewritten in C++ for efficiency. If the code involves random number generation, the output of the prototype and the rewrite cannot be directly compared, only compared on average. Then you have to judge whether the differences are to be expected or whether they indicate a bug. But if both the R and the C++ code use the same RNG algorithm and the same seed, the results may be directly comparable. (They still may not be directly comparable due to other factors, but at least this way the results are often comparable.)

As for cryptography, no, SimpleRNG is not appropriate for cryptography.

As for Monte Carlo applications, not all Monte Carlo applications are created equal. Some applications do not require high quality random number generators. Or more accurately, different applications require different kinds of quality. Some random number generators break down when used for high-dimensional integration. I suspect SimpleRNG is appropriate for moderate dimensions. I use the Mersenne Twister generator for numerical integration. However, SimpleRNG is faster and much simpler; the MT generator has a very large internal state.

Someone commented on the CodeProject article that the random number generator in .NET is not appropriate for Monte Carlo simulation because it does not pass Marsaglia’s DIEHARD tests while SimpleRNG does. I don’t know what algorithm the .NET generator uses, so I can’t comment on its quality. Before I’d use it in statistical applications, I’d want to find out.

Text reviews for software

When users find spelling and grammar errors in your software, your credibility takes a hit. But apparently very few software projects review the text their software displays. I imagine the ones that do review their text use a combination of two leaky methods: asking execution testers to take note of prose errors, and requiring that all text displayed to users be stored in a string table.

There are a couple problems with asking execution testers to be copy editors. First, they’re not copy editors. They may not recognize a grammatical error when they see it. Second, they only see the text that their path through the software exposes. Messages displayed to the user under unusual circumstances slip through testing.

String tables are a good idea. They can be reviewed by a professional editor. (Or translator, if you’re application is internationalized.) But it’s difficult to make sure that every string the user might see is in the string table. When you need to add a few quick lines of error-handling code, it’s so easy to just include the text right there in the code rather than adding an entry to the string table. After all, you say to yourself, the code’s probably not going to run anyway.

My solution was to write a script that extracts all the quoted text from a source tree so it can be reviewed separately. The script tries to only pick out strings that a user could see, filtering out, for example, code quoted inside code. Doing this perfectly would be very hard, but by tolerating a small error rate, the problem can be solved quickly in a few lines of code. I’ve used this script for years. Nearly every time I run it I discover potentially embarrassing errors.

In addition to helping with copy editing, an extract of all the string literals in a project gives an interesting perspective on the source code. For example, it could help uncover security risks such as SQL injection vulnerabilities.

I’ve posted an article on CodeProject along with the script I wrote.

PowerShell Script for Reviewing Text Shown to Users

The script on CodeProject is written for Microsoft’s PowerShell. If anyone would like a Perl version of the script, just let me know. I first wrote the script in Perl, but then moved it to PowerShell as my team was moving to PowerShell for all administrative scripting.