Data file character frequencies

I have a little script that will print the frequency of the most common characters in a file and the number of lines. All numbers are displayed along with their factorizations. It also prints the number of non-ASCII characters.

CSV files

These simple statistics are surprisingly useful. For example, when I ran it on an CSV file that I downloaded recently I got the following.

    ,   397907424  =  2^5 3^3 19 24239 
    0   58200944  =  2^4 1699 2141 
    2   52955465  =  5 467 22679 
    1   46413310  =  2 5 23 201797 
    3   34811225  =  5^2 1392449 

    Num lines:  1745208  =  2^3 3^2 24239 

    All ASCII characters

This strongly implies that the CSV file really is a CSV (comma-separated value) file. Sometimes you’ll get a file with a .csv extension but the separator is a tab, a pipe, or some other character.

The number of commas is a multiple of the number of lines. That’s a good sign. Apparently this is a CSV file with 12×19 columns and 1,745,208 rows [1]. If the number of separators is not a multiple of the number of lines, maybe some lines are incomplete. Or maybe your file separator appears inside a quoted string. This is not necessarily a problem, but it means the most naive parsing won’t work.

In the file above, the most common characters, other than commas, are digits, so the file probably contains mostly numeric data.

If your file contains quotation marks, better hope it contains an even number. Even better, an even multiple of the number of lines. If not, you have some troubleshooting to do.

Incidentally, whenever the subject of CSV files comes up, someone will say “Why are you using CSV files?! Don’t you know there are better formats?” My reply is that I’m a consultant and I take data in whatever format I can get it, and that most often means a delimiter-separated text file. That works fine, except, of course, when it doesn’t.

Unicode characters

A file with lots of non-ASCII characters is not a problem. A file with one non-ASCII character very often is a problem.

A single non-ASCII character could be an invisible character that will gum up parsing. This can be maddening to find if you’re relying on visual inspection. But if you know there’s a non-ASCII character where it shouldn’t be, such as in a file of digits and commas, then you can simply delete it.

JSON

If you’re inspecting a JSON file, you’d expect to see lots of braces. Hopefully you have an equal number of open and close braces. But if not, you know where to being troubleshooting. You should also expect a lot of colons. Knowing the number of braces and colons gives you a clue to the structure of the file.

Troubleshooting and guessing

When a file has no complications, the stats above tell you things you’d know from looking at the first line or two of the file. However, when there are complications, the stats can be useful.

The stats could also be useful in a context where it’s OK to make guesses. For example, you might have a script that guesses the structure of a file and proceeds accordingly. That’s fine when wrong guesses lead to obviously wrong output. It’s hard to imagine, for example, that mistaking an XML file for a CVS file would produce a subtle error.

Related posts

[1] The number of fields is one more than the number of separators. Or maybe not: there may be a trailing separator after the last field. So in this case there may be 228 or 229 columns.

5 thoughts on “Data file character frequencies

  1. Lovely post. I’m frequently in the same boat with clients for software. Folks seem astonished at what command-line tools can accomplish.

    Factorization of counts is fun but outside my client remit.

  2. That’s pretty cool. Me, I’d add a feature where it reports any character that occurs less than 5 times. Or maybe less that 0.001%.

    It would have helped me with the time a consultant decided to write a tool to re-write all of our .java files, and they ended up replacing the final newline in every file with a NULL byte. This was undetectable by almost every tool, except, of course, the Java compiler.

  3. Some CSV variants allow line breaks to appear in quoted fields (e.g. RFC 4180), so “troubleshooting” might just be upgrading your parser. :-)

  4. This is very cool, and I believe I will write a similar script for my own repertoire.

    I often find myself presented with files which have no indication as to format. I’ve gotten surprisingly good at visually inspecting the hexdump of the first ~1kb of a file and guessing what sort of data it might be. Most people can identify base64-encoded data by sight, of course, and with a little practice it’s not that hard to recognize the headers for most common compression algorithms.

    What’s especially interesting to me, however, is encrypted data. Every now and again I’m provided a binary blob (or, more frequently, a base64-encoded file which decodes to a binary blob) which has no recognizable format to it. It does not appear to be compressed, but it has two very interesting characteristics: first, the data do not appear structured in any way; and second, if you map the distribution of each byte in the file, you get a surprisingly even distribution across 0x00 – 0xff. There are two common systems I know of which tend to write data in formats like this: compression algorithms (which write headers) and encryption algorithms (which do not).

    Therefore, a modified version of your program which reports the per-byte distribution would be a good first-order approximation as to whether the file you are looking at is encrypted or not.

Comments are closed.