LaTeX command frequencies

In the previous post I present a bash one-liner to search directories for LaTeX files and count the commands used.

College files

I first tried this out on a directory that included some old files from grad school. I chose this directory because I knew it had a lot of LaTeX files, but I was surprised at the results. Here were the top 10 results:

  1. \Omega
  2. \partial
  3. \bf
  4. \in
  5. \mu
  6. \real
  7. \int
  8. \item
  9. \alpha
  10. \end

I was very surprised that the top command was \Omega. I expected maybe the integral command \int would come out on top.

The notes contain a lot of integrals, but these integrals were often over a domain Ω. The set inclusion command \in also appears frequently, probably in the context of saying x ∈ Ω.

The \partial came up frequently because I used it in two contexts. First, I was studying partial differential equations, so I used the symbol for partial derivative a lot. Second, I used ∂ to denote the boundary of a domain, as in ∂Ω.

You might notice that \end made the list above but \begin didn’t. Sounds like an error if LaTeX files have more \end statements than \begin statements. The reason is that I used to have an include file that had lots of macros and ended with \begin{document}. That saved a few keystrokes, but now I think such asymmetry is bad form. In the search described below, there are exactly the same number of \begin and \end statements.

Client files

When I looked at the command frequencies in a directory containing some client work, I got very different frequencies. Here were the top commands in that directory.

  1. \hline
  2. \item
  3. \end
  4. \begin
  5. \frac
  6. \xi
  7. \phi
  8. \lambda
  9. \texttt
  10. \partial

I suppose \hline is at the top because the files contained a lot of tables. It makes sense that \item, \begin, \end and \frac were near the top because those are common LaTeX commands. I don’t remember what I was working on that used the symbol ξ so much.

When I first thought about this post I thought I could get a feel for what commands are used frequently in LaTeX in general. I started with my own files because they’re at hand, but the results say more about my usage of LaTeX than about LaTeX in general.

Other collections

I imagine if you were to look at the frequency statistics for a large corpus, such as the articles submitted to a given math journal, the results would still depend somewhat on the journal: you’re going to see \int for integral more often in an analysis journal than in an algebra journal, etc.

If you run the code from the previous post on some collection of LaTeX files and get some interesting results, leave a comment describing what you found.

More LaTeX posts

Typesetting zodiac symbols in LaTeX

Typesetting zodiac symbols in LaTeX is admittedly an unusual thing to do. LaTeX is mostly used for scientific publication, and zodiac symbols are commonly associated with astrology. But occasionally zodiac symbols are used in more respectable contexts.

The wasysym package for LaTeX includes miscellaneous symbols, including zodiac symbols. Here are the symbols, their LaTeX commands, and their corresponding Unicode code points.

The only surprise here is that the command for Capricorn is based on the Latin form of the name: \capricornus.

Each zodiac sign is used to denote a 30° region of the sky. Since the Unicode symbols are consecutive, you can compute the code point of a symbol from the longitude angle θ in degrees:

9800 + \left\lfloor \frac{\theta}{30} \right\rfloor

Here 9800 is the decimal form of 0x2648, and the half brackets are the floor symbol, i.e. round down to the nearest integer.

Here’s the LaTeX code that produced the table.

\documentclass{article}
\usepackage{wasysym}
\begin{document}

\begin{table}
\begin{tabular}{lll}
\aries       & \verb|\aries       | & U+2648 \\
\taurus      & \verb|\taurus      | & U+2649 \\
\gemini      & \verb|\gemini      | & U+264A \\
\cancer      & \verb|\cancer      | & U+264B \\
\leo         & \verb|\leo         | & U+264C \\
\virgo       & \verb|\virgo       | & U+264D \\
\libra       & \verb|\libra       | & U+264E \\
\scorpio     & \verb|\scorpio     | & U+264F \\
\sagittarius & \verb|\sagittarius | & U+2650 \\
\capricornus & \verb|\capricornus | & U+2651 \\
\aquarius    & \verb|\aquarius    | & U+2652 \\
\pisces      & \verb|\pisces      | & U+2653 \\
\end{tabular}
\end{table}
\end{document}

By the way, you can use the Unicode values in HTML by replacing U+ with &#x and adding a semicolon on the end.

More LaTeX posts

Regular expressions and special characters

Special characters make text processing more complicated because you have to pay close attention to context. If you’re looking at Python code containing a regular expression, you have to think about what you see, what Python sees, and what the regular expression engine sees. A character may be special to Python but not to regular expressions, or vice versa.

This post goes through an example in detail that shows how to manage special characters in several different contexts.

Escaping special TeX characters

I recently needed to write a regular expression [1] to escape TeX special characters. I’m reading in text like ICD9_CODE and need to make that ICD9\_CODE so that TeX will understand the underscore to be a literal underscore, and a subscript instruction.

Underscore isn’t the only special character in TeX. It has ten special characters:

    \ { } $ & # ^ _ % ~

The two that people most commonly stumble over are probably $ and % because these are fairly common in ordinary prose. Since % begins a comment in TeX, importing a percent sign without escaping it will fail silently. The result is syntactically valid. It just effectively cuts off the remainder of the line.

So whenever my script sees a TeX special character that isn’t already escaped, I’d like it to escape it.

Raw strings

First I need to tell Python what the special characters are for TeX:

    special = r"\\{}$&#^_%~"

There’s something interesting going on here. Most of the characters that are special to TeX are not special to Python. But backslash is special to both. Backslash is also special to regular expressions. The r prefix in front of the quotes tells Python this is a “raw” string and that it should not interpret backslashes as special. It’s saying “I literally want a string that begins with two backslashes.”

Why two backslashes? Wouldn’t one do? We’re about to use this string inside a regular expression, and backslashes are special there too. More on that shortly.

Lookbehind

Here’s my regular expression:

    re.sub(r"(?<!\\)([" + special + "])", r"\\\1", line)

I want special characters that have not already been escaped, so I’m using a negative lookbehind pattern. Negative lookbehind expressions begin with (?<! and end with ). So if, for example, I wanted to look for the string “ball” but only if it’s not preceded by “charity” I could use the regular expression

    (?<!charity )ball

This expression would match “foot ball” or “foosball” but not “charity ball”.

Our lookbehind expression is complicated by the fact that the thing we’re looking back for is a special character. We’re looking for a backslash, which is a special character for regular expressions [2].

After looking behind for a backslash and making sure there isn’t one, we look for our special characters. The reason we used two backslashes in defining the variable special is so the regular expression engine would see two backslashes and interpret that as one literal backslash.

Captures

The second argument to re.sub tells it what to replace its match with. We put parentheses around the character class listing TeX special characters because we want to capture it to refer to later. Captures are referred to by position, so the first capture is \1, the second is \2, etc.

We want to tell re.sub to put a backslash in front of the first capture. Since backslashes are special to the regular expression engine, we send it \\ to represent a literal backslash. When we follow this with \1 for the first capture, the result is \\\1 as above.

Testing

We can test our code above on with the following.

    line = r"a_b $200 {x} %5 x\y"

and get

    a\_b \$200 \{x\} \%5 x\\y

which would cause TeX to produce output that looks like

a_b $200 {x} %5 x\y.

Note that we used a raw string for our test case. That was only necessary for the backslash near the end of the string. Without that we could have dropped the r in front of the opening quote.

P.S. on raw strings

Note that you don’t have to use raw strings. You could just escape your special characters with backslashes. But we’ve already got a lot of backslashes here. Without raw strings we’d need even more. Without raw strings we’d have to say

    special = "\\\\{}$&#^_%~"

starting with four backslashes to send Python two to send the regular expression engine one.

Related posts

[1] Whenever I write about using regular expressions someone will complain that my solution isn’t completely general and that they can create input that will break my code. I understand that, but it works for me in my circumstances. I’m just writing scripts to get my work done, not claiming to have written hardened production software for anyone else to use.

[2] Keep context in mind. We have three languages in play: TeX, Python, and regular expressions. One of the keys to understanding regular expressions is to see them as a small language embedded inside other languages like Python. So whenever you hear a character is special, ask yourself “Special to whom?”. It’s especially confusing here because backslash is special to all three languages.

Trademark symbol, LaTeX, and Unicode

Earlier this year I was a coauthor on a paper about the Cap Score™ test for male fertility from Androvia Life Sciences [1]. I just noticed today that when I added the publication to my CV, it caused some garbled text to appear in the PDF.

garbled LaTeX output

Here is the corresponding LaTeX source code.

LaTeX source

Fixing the LaTeX problem

There were two problems: the trademark symbol and the non-printing symbol denoted by a red underscore in the source file. The trademark was a non-ASCII character (Unicode U+2122) and the underscore represented a non-printing (U+00A0). At first I only noticed the trademark symbol, and I fixed it by including a LaTeX package to allow Unicode characters:

    \usepackage[utf8x]{inputenc}

An alternative fix, one that doesn’t require including a new package, would be to replace the trademark Unicode character with \texttrademark\. Note the trailing backslash. Without the backslash there would be no space after the trademark symbol. The problem with the unprintable character would remain, but the character could just be deleted.

Trademark and Unicode

I found out there are two Unicode code points render the trademark glyph, U+0099 and U+2122. The former is in the Latin 1 Supplement section and is officially a control character. The correct code point for the trademark symbol is the latter. Unicode files U+2122 under Letterlike Symbols and gives it the official name TRADE MARK SIGN.

Related posts

[1] Jay Schinfeld, Fady Sharara, Randy Morris, Gianpiero D. Palermo, Zev Rosenwaks, Eric Seaman, Steve Hirshberg, John Cook, Cristina Cardona, G. Charles Ostermeier, and Alexander J. Travis. Cap-Score™ Prospectively Predicts Probability of Pregnancy, Molecular Reproduction and Development. To appear.

Typesetting modal logic

Modal logic extends propositional logic with two new operators, □ (“box”) and ◇ (“diamond”). There are many interpretations of these two symbols, the most common being necessity and possibility respectively. That is, □p means the proposition p is necessary, and ◇p means that p is possible. Another interpretation is using the symbols to represent things a person knows to be true and things that may be true as far as that person knows.

There are also many axiom systems for inference concerning these operators. For example, some axiom systems include the rule

\Box p \rightarrow \Box \Box p

and some do not. If you interpret □ as saying a proposition is provable, this axiom says whatever is provable is provably provable, which makes sense. But if you take □ to be a statement about what an agent knows, you may not want to say that if an agent knows something, it knows that it knows it.

See the next post for an example of applying logic to security, a logic with lots of modal operators and axioms. But for now, we’ll focus on how to typeset the box and diamond operators.

LaTeX

In LaTeX, the most obvious commands would be \box and \diamond, but that doesn’t work. There is no \box command, though there is a \square command. And although there is a \diamond command, it produces a symbol much smaller than \square and so the two look odd together. The two operators are dual in the sense that

\begin{align*} \Box p &= \neg \Diamond \neg p \\ \Diamond p &= \neg \Box \neg p \end{align*}

and so they should have symbols of similar size. A better approach is to use \Box and \Diamond. Those were used in the displayed equations above. These symbols are in the amsfonts package.

Unicode

There are many box-like and diamond-like symbols in Unicode. It seems reasonable to use U+25A1 for box and U+25C7 for diamond. I don’t know of any more semantically appropriate characters. There are no Unicode characters with “modal” in their name, for example.

HTML

You can always insert Unicode characters into HTML by using &#x, followed by the hexadecimal value of the codepoint, followed by a semicolon. For example, I typed &#x25a1; and &#x25c7; to enter the box and diamond symbols above.

If you want to stick to HTML entities because they’re easier to remember, you’re mostly out of luck. There is no HTML entity for the box operator. There is an entity &loz; for “lozenge,” the typographical term for a diamond. This HTML entity corresponds to U+25CA and is smaller than U+25C7 recommended above. As discussed in the context of LaTeX, you want the box and diamond operators to have a similar size.

Related posts

Fraktur symbols in mathematics

When mathematicians run out of symbols, they turn to other alphabets. Most math symbols are Latin or Greek letters, but occasionally you’ll run into Russian or Hebrew letters.

Sometimes math uses a new font rather than a new alphabet, such as Fraktur. This is common in Lie groups when you want to associate related symbols to a Lie group and its Lie algebra. By convention a Lie group is denoted by an ordinary Latin letter and its associated Lie algebra is denoted by the same letter in Fraktur font.

lower case alphabet in Fraktur

LaTeX

To produce Fraktur letters in LaTeX, load the amssymb package and use the command \mathfrak{}.

Symbols such as \mathfrak{A} are math symbols and can only be used in math mode. They are not intended to be a substitute for setting text in Fraktur font. This is consistent with the semantic distinction in Unicode described below.

Unicode

The Unicode standard tries to distinguish the appearance of a symbol from its semantics, though there are compromises. For example, the Greek letter Ω has Unicode code point U+03A9 but the symbol Ω for electrical resistance in Ohms is U+2621 even though they are rendered the same [1].

The letters a through z, rendered in Fraktur font and used as mathematical symbols, have Unicode values U+1D51E through U+1D537. These values are in the “Supplementary Multilingual Plane” and do not commonly have font support [2].

The corresponding letters A through Z are encoded as U+1D504 through U+1D51C, though interestingly a few letters are missing. The code point U+1D506, which you’d expect to be Fraktur C, is reserved. The spots corresponding to H, I, and R are also reserved. Presumably these are reserved because they are not commonly used as mathematical symbols. However, the corresponding bold versions U+1D56C through U+ID585 have no such gaps [3].

Footnotes

[1] At least they usually are. A font designer could choose provide different glyphs for the two symbols. I used the same character for both because some I thought some readers might not see the Ohm symbol properly rendered.

[2] If you have the necessary fonts installed you should see the alphabet in Fraktur below:
𝔞 𝔟 𝔠 𝔡 𝔢 𝔣 𝔤 𝔥 𝔦 𝔧 𝔨 𝔩 𝔪 𝔫 𝔬 𝔭 𝔮 𝔯 𝔰 𝔱 𝔲 𝔳 𝔴 𝔵 𝔶 𝔷

I can see these symbols from my desktop and from my iPhone, but not from my Android tablet. Same with the symbols below.

[3] Here are the bold upper case and lower case Fraktur letters in Unicode:
𝕬 𝕭 𝕮 𝕯 𝕰 𝕱 𝕲 𝕳 𝕴 𝕵 𝕶 𝕷 𝕸 𝕹 𝕺 𝕻 𝕼 𝕽 𝕾 𝕿 𝖀 𝖁 𝖂 𝖃 𝖄 𝖅
𝖆 𝖇 𝖈 𝖉 𝖊 𝖋 𝖌 𝖍 𝖎 𝖏 𝖐 𝖑 𝖒 𝖓 𝖔 𝖕 𝖖 𝖗 𝖘 𝖙 𝖚 𝖛 𝖜 𝖝 𝖞 𝖟

Putting a brace under something in LaTeX

Here’s a useful LaTeX command that I learned about recently: \underbrace.

It does what it sounds like it does. It puts a brace under its argument.

I used this a few days ago in the post on the new prime record when I wanted to show that the record prime is written in hexadecimal as a 1 followed by a long string of Fs.

1\underbrace{\mbox{FFF \ldots FFF}}_\mbox{{\normalsize 9,308.229 F's}}

The code that produced is

1\underbrace{\mbox{FFF \ldots FFF}}_\mbox{{\normalsize 9,308.229 F's}}

The sizing is a little confusing. Without \normalsize the text under the brace would be as large as the text above.

Why don’t you simply use XeTeX?

From an FAQ post I wrote a few years ago:

This may seem like an odd question, but it’s actually one I get very often. On my TeXtip twitter account, I include tips on how to create non-English characters such as using \AA to produce Å. Every time someone will ask “Why not use XeTeX and just enter these characters?”

If you can “just enter” non-English characters, then you don’t need a tip. But a lot of people either don’t know how to do this or don’t have a convenient way to do so. Most English speakers only need to type foreign characters occasionally, and will find it easier, for example, to type \AA or \ss than to learn how to produce Å or ß from a keyboard. If you frequently need to enter Unicode characters, and know how to do so, then XeTeX is great.

One does not simply type Unicode characters.

More Unicode posts