Once in a while I need to know what characters are in a file and how often each appears. One reason I might do this is to look for statistical anomalies. Another reason might be to see whether a file has any characters it’s not supposed to have, which is often the case.
A few days ago Fatih Karakurt left an elegant solution to this problem in a comment:
fold -w1 file | sort | uniq -c
The fold
function breaks the content of a file in to lines 80 characters long by default, but you can specify the line width with the -w
option. Setting that to 1 makes each character its own line. Then sort
prepares the input for uniq
, and the -c
option causes uniq
to display counts.
This works on ASCII files but not Unicode files. For a Unicode file, you might do something like the following Python code.
import collections count = collections.Counter() file = open("myfile", "r", encoding="utf8") for line in file.readlines(): for c in line.strip("\n"): count[ord(c)] += 1 for p in sorted(list(count)): print(chr(p), hex(p), count[p])
“This works on ASCII files but not Unicode files.”
It actually works okay for me, but I suspect my LANG settings have something to do with that:
0] joshmyer@traceability:~ $ echo $LANG
en_US.UTF-8
0] joshmyer@traceability:~ $ cat foo
你好
asdf
0] joshmyer@traceability:~ $ LANG=C cat foo | fold -w1 | sort | uniq -c
1
1 a
1 d
1 f
1 s
1 你
1 好
collections.Counter(open(filename).read()).most_common()