File character counts

Once in a while I need to know what characters are in a file and how often each appears. One reason I might do this is to look for statistical anomalies. Another reason might be to see whether a file has any characters it’s not supposed to have, which is often the case.

A few days ago Fatih Karakurt left an elegant solution to this problem in a comment:

    fold -w1 file | sort | uniq -c

The fold function breaks the content of a file in to lines 80 characters long by default, but you can specify the line width with the -w option. Setting that to 1 makes each character its own line. Then sort prepares the input for uniq, and the -c option causes uniq to display counts.

This works on ASCII files but not Unicode files. For a Unicode file, you might do something like the following Python code.

import collections

count = collections.Counter()
file = open("myfile", "r", encoding="utf8")
for line in file.readlines():
    for c in line.strip("\n"):
        count[ord(c)] += 1

for p in sorted(list(count)):
    print(chr(p), hex(p), count[p])

2 thoughts on “File character counts

  1. “This works on ASCII files but not Unicode files.”

    It actually works okay for me, but I suspect my LANG settings have something to do with that:
    0] joshmyer@traceability:~ $ echo $LANG
    en_US.UTF-8
    0] joshmyer@traceability:~ $ cat foo
    你好
    asdf
    0] joshmyer@traceability:~ $ LANG=C cat foo | fold -w1 | sort | uniq -c
    1
    1 a
    1 d
    1 f
    1 s
    1 你
    1 好

Comments are closed.