I started this post by wanting to look at the frequency of LaTeX commands, but then thought that some people mind find the code to find the frequencies more interesting than the frequencies themselves.
So I’m splitting this into two posts. This post will look at the shell one-liner to find command frequencies, and the next post will look at the actual frequencies.
I want to explore LaTeX files, so I’ll start by using find
to find such files.
find . -name "*.tex"
This searches for files ending in .tex
, starting with the current directory (hence .
) and searching recursively into subdirectories. The find
command explores subdirectories by default; you have to tell it not to if that’s not what you want.
Next, I want to use grep
to search the LaTeX files. If I pipe the output of find
to grep
it will search the file names, but I want it to search the file contents. The xargs
command takes care of this, receiving the file names and passing them along as file names, i.e. not as text input.
find . -name "*.tex" | xargs grep ...
LaTeX commands have the form of a backslash followed by letters, so the regular expression I’ll pass is \\[a-z]+
. This says to look for a literal backslash followed by one or more letters.
I’ll give grep
four option flags. I’ll use -i
to ask it to use case-insensitive matching, because LaTeX commands can begin contain capital letters. I’ll use -E
to tell it I want to use extended regular expressions [1].
I’m after just the commands, not the lines containing commands, and so I use the -o
option to tell grep
to return just the commands, one per line. But that’s not enough. It would be enough if we were only search one file, but since we’re searching multiple files, the default behavior is for grep
to return the file name as well. The -h
option tells it to only return the matches, no file names.
So now we’re up to this:
find . -name "*.tex" | xargs grep -oihE '\\[a-z]+'
Next I want to count how many times each command occurs, and I need to sort the output first so that uniq
will count correctly.
find . -name "*.tex" | xargs grep -oihE '\\[a-z]+' | sort | uniq -c
And finally I want to sort the output by frequency, in descending order. The -n
option tells sort
to sort numerically, and -r
says to sort in descending order than the default ascending order. This produces a lot of output, so I pipe everything to less
to view it one screen at a time.
find . -name "*.tex" | xargs grep -oihE '\\[a-z]+' | sort | uniq -c | sort -rn | less
That’s my one-liner. In the next post I’ll look at the results.
More command line posts
[1] I learned regular expressions from writing Perl long ago. What I think of a simply a regular expression is what grep
calls “extended” regular expressions, so adding the -E
option keeps me out of trouble in case I use a feature that grep
considers an extension. You could use egrep
instead, which is essentially the same as grep -E
.
You’ll want to add a -print0 to the find and a -0 to the xargs to handle filenames with spaces in them.
find . -name “*.lex” -print0 | xargs -P 0 -0 egrep -oih ‘\\[a-z]+’ | sort | uniq -c | sort -rn | wc -l
That’s one reason why I don’t put spaces in file names. :)
Yes, that would make the code more general.
You may know this, but you can use the exec action in find to do many of the things for which you might use find … | xargs. This can have better performance, and it can also avoid issues with unusual characters such as whitespace in the file names without needing -print0.
find . -name ‘*.tex’ -exec egrep -oih ‘\\[a-z]+’ {} + | …
It’s also less to type than the safe -print0 | xargs -0 version, which I like.
The xargs version has the advantage that it can be used to start multiple jobs in parallel, which is handy sometimes. Gordon’s example using -P 0 demonstrates this.
Came here to mention the -exec option. I use it all the time, it’s easier for me to remember that syntax than xargs. I always screw up xargs.
Looks good, I do this all the time. This is a classic pattern straight from Doug McIlroy’s response to Knuth (the author of Tex!). I would also add that sort is O(n log n), so for large data sets or weak laptops, you’ll be happier with awk.