Five lemma, ASCII art, and Unicode

A few days ago I wrote about creating ASCII art in Emacs using ditaa. Out of curiosity, I wanted to try making the Five Lemma diagram. [1]

The examples in the ditaa site all have arrows between boxes, but you don’t have to have boxes.

Here’s the ditaa source:

A₀ ---> A₁ ---> A₂ ---> A₃ ---> A₄
|       |       |       |       |
| f₀    | f₁    | f₂    | f₃    | f₄
|       |       |       |       |
v       v       v       v       v
B₀ ---> B₁ ---> B₂ ---> B₃ ---> B₄


and here’s the image it produces:

It’s not pretty. You could make a nicer image with LaTeX. But as the old saying goes, the remarkable thing about a dancing bear is not that it dances well but that it dances at all.

The trick to getting the subscripts is to use Unicode characters 0x208n for subscript n. As I noted at the bottom of this post, ditaa isn’t strictly limited to ASCII art. You can use Unicode characters as well. You may or may not be able to see the subscripts in the source code they are not part of the most widely supported set of characters.

* * *

[1]  The Five Lemma is a diagram-chasing result from homological algebra. It lets you infer properties the middle function f from properties of the other f‘s.

Graphemes

Here’s something amusing I ran across in the glossary of Programming Perl:

grapheme A graphene is an allotrope of carbon arranged in a hexagonal crystal lattice one atom thick. Grapheme, or more fully, a grapheme cluster string is a single user-visible character, which in turn may be several characters (codepoints) long. For example … a “ȫ” is a single grapheme but one, two, or even three characters, depending on normalization.

In case the character ȫ doesn’t display correctly for you, here it is:

First, graphene has little to do with grapheme, but it’s geeky fun to include it anyway. (Both are related to writing. A grapheme has to do with how characters are written, and the word graphene comes from graphite, the “lead” in pencils. The origin of grapheme has nothing to do with graphene but was an analogy to phoneme.)

Second, the example shows how complicated the details of Unicode can get. The Perl code below expands on the details of the comment about ways to represent ȫ.

This demonstrates that the character . in regular expressions matches any single character, but \X matches any single grapheme. (Well, almost. The character . usually matches any character except a newline, though this can be modified via optional switches. But \X matches any grapheme including newline characters.)


# U+0226, o with diaeresis and macron
my $a = "\x{22B}"; # U+00F6 U+0304, (o with diaeresis) + macron my$b = "\x{F6}\x{304}";

# o U+0308 U+0304, o + diaeresis + macron
my $c = "o\x{308}\x{304}"; my @versions = ($a, $b,$c);

# All versions display the same.
say @versions;

# The versions have length 1, 2, and 3.
# Only $a contains one character and so matches . say map {length$_ if /^.$/} @versions; # All versions consist of one grapheme. say map {length$_ if /^\X$/} @versions;  For daily tips on regular expressions, follow @RegexTip on Twitter. Which Unicode characters can you depend on? Unicode is supported everywhere, but font support for Unicode characters is sparse. When you use any slightly uncommon character, you have no guarantee someone else will be able to see it. I’m starting a Twitter account @MusicTheoryTip and so I wanted to know whether I could count on followers seeing music symbols. I asked whether people could see ♭ (flat, U+266D), ♮ (natural, U+266E), and ♯ (sharp, U+266F). Most people could see all three symbols, from desktop or phone, browser or Twitter app. However, several were unable to see the natural sign from an Android phone, whether using a browser or a Twitter app. One person said none of the symbols show up on his Blackberry. I also asked @diff_eq followers whether they could see the math symbols ∂ (partial, U+2202), Δ (Delta, U+0394), and ∇ (gradient, U+2207). One person said he couldn’t see the gradient symbol, but the rest of the feedback was positive. So what characters can you count on nearly everyone being able to see? To answer this question, I looked at the characters in the intersection of several common fonts: Verdana, Georgia, Times New Roman, Arial, Courier New, and Droid Sans. My thought was that this would make a very conservative set of characters. There are 585 characters supported by all the fonts listed above. Most of the characters with code points up to U+01FF are included. This range includes the code blocks for Basic Latin, Latin-1 Supplement, Latin Extended-A, and some of Latin Extended-B. The rest of the characters in the intersection are Greek and Cyrillic letters and a few scattered symbols. Flat, natural, sharp, and gradient didn’t make the cut. There are a dozen math symbols included: 0x2202 ∂ 0x2206 ∆ 0x220F ∏ 0x2211 ∑ 0x2212 − 0x221A √ 0x221E ∞ 0x222B ∫ 0x2248 ≈ 0x2260 ≠ 0x2264 ≤ 0x2265 ≥ Interestingly, even in such a conservative set of characters, there are a three characters included for semantic distinction: the minus sign (i.e. not a hyphen), the difference operator (i.e. not the Greek letter Delta), and the summation operator (i.e. not the Greek letter Sigma). And in case you’re interested, here’s the complete list of the Unicode characters in the intersection of the fonts listed here. (Update: Added notes to indicate the start of a new code block and listed some of the isolated characters.) 0x0009 Basic Latin 0x000d 0x0020 - 0x007e 0x00a0 - 0x017f Latin-1 supplement 0x0192 0x01fa - 0x01ff 0x0218 - 0x0219 0x02c6 - 0x02c7 0x02c9 0x02d8 - 0x02dd 0x0300 - 0x0301 0x0384 - 0x038a Greek and Coptic 0x038c 0x038e - 0x03a1 0x03a3 - 0x03ce 0x0401 - 0x040c 0x040e - 0x044f Cyrillic 0x0451 - 0x045c 0x045e - 0x045f 0x0490 - 0x0491 0x1e80 - 0x1e85 Latin extended additional 0x1ef2 - 0x1ef3 0x200c - 0x200f General punctuation 0x2013 - 0x2015 0x2017 - 0x201e 0x2020 - 0x2022 0x2026 0x2028 - 0x202e 0x2030 0x2032 - 0x2033 0x2039 - 0x203a 0x203c 0x2044 0x206a - 0x206f 0x207f 0x20a3 - 0x20a4 Currency symbols ₣ ₤ 0x20a7 ₧ 0x20ac € 0x2105 Letterlike symbols ℅ 0x2116 № 0x2122 ™ 0x2126 Ω 0x212e ℮ 0x215b - 0x215e ⅛ ⅜ ⅝ ⅞ 0x2202 Mathematical operators ∂ 0x2206 ∆ 0x220f ∏ 0x2211 - 0x2212 ∑ − 0x221a √ 0x221e ∞ 0x222b ∫ 0x2248 ≈ 0x2260 ≠ 0x2264 - 0x2265 ≤ ≥ 0x25ca Box drawing ◊ 0xfb01 - 0xfb02 Alphabetic presentation forms ﬁ ﬂ A web built on LaTeX The other day on TeXtip, I threw this out: Imagine if the web had been built on LaTeX instead of HTML … Here are some of the responses I got: • It would have been more pretty looking. • Frightening. • Single tear down the cheek. • No crap amateurish content because of the steep learning curve, and beautiful rendering … What a dream! • Shiny math, crappy picture placement: glad it did not! • Overfull hboxes EVERYWHERE. • LaTeX would have become bloated, and people would be tweeting about HTML being so much better. • Noooo! LaTeX would have been “standardised”, “extended” and would by now be a useless pile of complexity. * * * For daily tips on LaTeX and typography, follow @TeXtip on Twitter. Use typewriter font for code inside prose There’s a useful tradition of using a typewriter font, or more generally some monospaced font, for bits of code sprinkled in prose. The practice is analogous to using italic to mark, for example, a French mot dropped into an English paragraph. In HTML, the code tag marks content as software code, which a browser typically will render in a typewriter font. Here’s a sentence from a new article on Python at Netflix that could benefit a few code tags. These features (and more) have led to increasingly pervasive use of Python in everything from small tools using boto to talk to AWS, to storing information with python-memcached and pycassa, managing processes with Envoy, polling restful APIs to large applications with requests, providing web interfaces with CherryPy and Bottle, and crunching data with scipy. Here’s the same sentence with some code tags. These features (and more) have led to increasingly pervasive use of Python in everything from small tools using boto to talk to AWS, to storing information with python-memcached and pycassa, managing processes with Envoy, polling restful APIs to large applications with requests, providing web interfaces with CherryPy and Bottle, and crunching data with scipy. It’s especially helpful to let the reader know that packages like requests are indeed packages. It helps to clarify, for example, whether Wes McKinney has been stress testing pandas or pandas. That way we know whether to inform animal protection authorities or to download a new version of a library. Unicode to LaTeX I’ve run across a couple web sites that let you enter a LaTeX symbol and get back its Unicode value. But I didn’t find a site that does the reverse, going from Unicode to LaTeX, so I wrote my own. Unicode / LaTeX Conversion If you enter Unicode, it will return LaTeX. If you enter LaTeX, it will return Unicode. It interprets a string starting with “U+” as a Unicode code point, and a string starting with a backslash as a LaTeX command. For example, the screenshot above shows what happens if you enter U+221E and click “convert.” You could also enter infty and get back U+221E. However, if you go from Unicode to LaTeX to Unicode, you won’t always end up where you started. There may be multiple Unicode values that map to a single LaTeX symbol. This is because Unicode is semantic and LaTeX is not. For example, Unicode distinguishes between the Greek letter Ω and the symbol Ω for ohms, the unit of electrical resistance, but LaTeX does not. * * * For daily tips on LaTeX and typography, follow @TeXtip on Twitter. Letters that fell out of the alphabet Mental Floss had an interesting article called 12 letters that didn’t make the alphabet. A more accurate title might be 12 letters that fell out of the modern English alphabet. I thought it would have been better if the article had included the Unicode values of the letters, so I did a little research and created the following table.  Name Capital Small Thorn U+00DE U+00FE Wynn U+01F7 U+01BF Yogh U+021C U+021D Ash U+00C6 U+00E6 Eth U+00D0 U+00F0 Ampersand U+0026 Insular g U+A77D U+1D79 Thorn with stroke U+A764 U+A765 Ethel U+0152 U+0153 Tironian ond U+204A Long s U+017F Eng U+014A U+014B Once you know the Unicode code point for a symbol, you can find out more about it, for example, here. Related posts: Entering Unicode characters in Windows and Linux. To enter a Unicode character in Emacs, you can type C-x 8 <return>, then enter the value. Automatic delimiter sizes in LaTeX I recently read a math book in which delimiters never adjusted to the size of their content or the level of nesting. This isn’t unusual in articles, but books usually pay more attention to typography. Here’s a part of an equation from the book: Larger outer parentheses make the equation much easier to read, especially as part of a complex equation. It’s clear at a glance that the function φ-1 applies to the result of the integral. The first equation was typeset using \varphi^{-1} ( \int \varphi(f+g) ,dmu ) The latter used left and right to tell LaTeX that the parentheses should grow to match the size of the content between them. \varphi^{-1} \left( \int \varphi(f+g) ,d\mu \right) You can use \left and \right with more delimiters than just parentheses: braces, brackets, ceiling, floor, etc. And the left and right delimiters do not need to match. You could make a half-open interval, for example, with \left( on one side and \right] on the other. For every \left delimiter there must be a corresponding \right delimiter. However, you can make one of the pair empty by using a period as its mate. For example, you could start an expression with \left[ and end it with \right. which would create a left bracket as tall as the tallest thing between that bracket and the corresponding \right. command. Note that \right. causes nothing to be displayed, not even a period. The most common example of a delimiter with no mate may be a curly brace on the left with no matching brace on the right. In that case you’d need to open with \left{. The backslash in front of the brace is necessary to tell LaTeX that you want a literal brace and that you’re not just using the brace for grouping. * * * For daily tips on LaTeX and typography, follow @TeXtip on Twitter. The paper is too big In response to the question “Why are default LaTeX margins so big?” Paul Stanley answers It’s not that the margins are too wide. It’s that the paper is too big! This sounds flippant, but he gives a compelling argument that paper really is too big for how it is now used. As is surely by now well-known, the real question is the size of the text block. That is a really important factor in legibility. As others have noted, the optimum line length is broadly somewhere between 60 characters and 75 characters. Given reasonable sizes of font which are comfortable for reading at the distance we want to read at (roughly 9 to 12 point), there are only so many line lengths that make sense. If you take a book off your shelf, especially a book that you would actually read for a prolonged period of time, and compare it to a LaTeX document in one of the standard classes, you’ll probably notice that the line length is pretty similar. The real problem is with paper size. As it happens, we have ended up with paper sizes that were never designed or adapted for printing with 10-12 point proportionally spaced type. They were designed for handwriting (which is usually much bigger) or for typewriters. Typewriters produced 10 or 12 characters per inch: so on (say) 8.5 inch wide paper, with 1 inch margins, you had 6.5 inches of type, giving … around 65 to 78 characters: in other words something pretty close to ideal. But if you type in a standard proportionally spaced font (worse, in Times — which is rather condensed because it was designed to be used in narrow columns) at 12 point, you will get about 90 to 100 characters in the line. He then gives six suggestions for what to do about this. You can see his answer for a full explanation. Here I’ll just summarize his points. 1. Use smaller paper. 2. Use long lines of text but extra space between lines. 3. Use wide margins. 4. Use margins for notes and illustrations. 5. Use a two column format. 6. Use large type. Given these options, wide margins (as in #3 and #4) sound reasonable. * * * For daily tips on LaTeX and typography, follow @TeXtip on Twitter. Gutenberg + Readability Here’s a very simple idea: Use Project Gutenberg for content and Readability for style. Project Gutenberg has a large collection of public domain books in digital form. The books are available in several formats, none of which are ideal for reading. Project Gutenberg provides text without much styling in order to make it easier for people to use the content as they please. You can go to the HTML version of a book on Gutenberg and use Readability (or Instapaper) to format it for easier reading. Importing the HTML page to a Kindle similarly improves the formatting. * * * Has anyone made a style sheet to approximate the look of Readability or Instapaper? I’d like to use something like that to improve the appearance of the static HTML pages on my site. Readability The Readability bookmarklet lets you reformat any web to make it easier to read. It strips out flashing ads and other distractions. It uses black text on a white background, wide margins, a moderate-sized font, etc. I use Readability fairly often. (Instapaper is a similar service. I discuss it at the end of this post.) Yesterday I used it to reformat an article on literate programming. For some inexplicable reason, the author chose to use a lemon yellow background. It’s ironic that the article is about making source code easier to read. The content of the article is easy to read, but the format is not. Readability to the rescue! Here are before and after screen shots. Before: After: I recommend the article, Example of Literate Programming in HTML, and I also recommend using reformatting the page unless you enjoy reading black text on a yellow background. Readability did a good job until about half way through the article. The article has C and HTML code examples, and perhaps these confused Readability. (Readability usually handles code samples well. It correctly formats the first few code samples in this article.) The last half of the article renders like source code, and the font gets smaller and smaller. I ran the page through an HTML validator to see whether some malformed HTML could be the source of the problem. The validator found numerous problems, so perhaps that was the issue. I haven’t seen Readability fail like this before. I’ve been surprised how well it has handled some pages I thought might trip it up. I ended up saving the article and editing its source, changing the bgcolor value to white. It’s a nice article on literate programming once you get past the formatting. The best part of the article is the first section, and that much Readability formats correctly. Instapaper Instapaper reformats web pages similarly. It produces a narrower column of text, but otherwise the output looks quite similar. Instapaper did not discover the title of the literate programming article. (The title of the article was not in an <h1> tag as software might expect but was only in a <title> tag in the page header.) However, it did format the entire body of the article correctly. I find it slightly more convenient to use the Readability bookmarklet than to submit a link to Instapaper. I imagine there are browser plug-ins that make Instapaper just as easy to use, though I haven’t looked into this because I’m usually satisfied with Readability. Related posts: Draw a symbol, look it up LaTeX users may know about Detexify, a web site that lets you draw a character then looks up its TeX command. Now there’s a new site Shapecatcher that does the same thing for Unicode. According to the site, “Currently, there are 10,007 Unicode character glyphs in the database.” It does not yet support Chinese, Japanese, or Korean. For example, I drew a treble clef on the page: The site came back with a list of possible matches, and the first one was what I was hoping for: Interestingly, the sixth possible match on the list was a symbol for contour integration: Notice the treble clef response has a funny little box on the right side. That’s because my browser did not have a glyph to display that Unicode character. The browser did have a glyph for the contour integration symbol and displayed it. Another Unicode resource I recommend is this Unicode Codepoint Chart. It is organized by code point value, in blocks of 256. If you were looking for the contour integration symbol above, for example, you could click on a link “U+2200 to U+22FF: Mathematical Operators” and see a grid of 256 symbols and click on the one you’re looking for. This site gives more detail about each character than does Shapecatcher. So you might use Shapecatcher to find where to start looking, then go to the Unicode Codepoint Chart to find related symbols or more details. Other posts on Unicode: For daily tips on LaTeX and typography, follow @TeXtip on Twitter. Typesetting “C#” in LaTeX How do you refer to the C# programming language in LaTeX? Simply typing C# doesn’t work because # is a special character in LaTeX. You could type C#. That works, but it looks a little odd. The number sign is too big and too low. What about using a musical sharp sign, i.e. C$\sharp$? That also looks a little odd. Even though the language is pronounced “C sharp,” it’s usually written with a number sign, not a sharp. Let’s look at recommended ways of typesetting C++ to see whether that helps. The top answer to this question on TeX Stack Exchange is to define a new command as follows: \newcommand{\CC}{C\nolinebreak\hspace{-.05em}\raisebox{.4ex}{\tiny\bf +}\nolinebreak\hspace{-.10em}\raisebox{.4ex}{\tiny\bf +}} This does several things. First, it prevents line breaks between the constituent characters. It also does several things to the plus signs: • Draws them in closer • Makes them smaller • Raises them • Makes them bold The result is what we’re subconsciously accustomed to seeing in print. Here’s an analogous command for C#. \newcommand{\CS}{C\nolinebreak\hspace{-.05em}\raisebox{.6ex}{\tiny\bf #}} And here’s the output. The number sign is a little too small. To make a little larger number sign, replace \tiny with \scriptsize. Related posts: For daily tips on LaTeX and typography, follow @TeXtip on Twitter. Typesetting chemistry in LaTeX Yesterday I gave the following tip on TeXtip: Set chemical formulas with math Roman. Example: sulfate is$mathrm{SO_4^{2-}}\$

TorbjoernT and scmbradley let me know there’s a better way: use Martin Hansel’s package mhchem. The package is simpler to use and it correctly handles subtle typographical details.

Using the mhchem package, sulfate would be written ce{SO4^2-}. In addition to chemical compounds, mhchem has support for bonds, arrows, and related chemical notation.

Example:

Source:

\documentclass{article}
\usepackage[version=3]{mhchem}
\parskip=0.1in
\begin{document}

\ce{SO4^2-}

\ce{^{227}_{90}Th+}

\ce{A\bond{-}B\bond{=}C\bond{#}D}

\ce{CO2 + C -> 2CO}

\ce{SO4^2- + Ba^2+ -> BaSO4 v}

\end{document}

For more information, see the mhchem package documentation.

Related posts:

For daily tips on LaTeX and typography, follow @TeXtip on Twitter.

Complexity of HTML and LaTeX

Sometime around 1994, my office mate introduced me to HTML by saying it was 10 times simpler than LaTeX. At the time I thought he was right. Now I’m not so sure. Maybe he was right in 1994 when the expectations for HTML were very low.

It is easier to bang out a simple, ugly HTML page than to write your first LaTeX document. When you compare the time required to make an attractive document, the effort becomes more comparable. The more sophisticated you get, the simpler LaTeX becomes by comparison.

Of course the two languages are not exactly comparable. HTML targets a web browser while LaTeX targets paper. HTML would be much simpler if people only used it to create documents to print out on their own printer. A major challenge with HTML is not knowing how someone else will use your document. You don’t know what browser they will view it with, at what resolution, etc. For that matter, you don’t know whether they’re even going to view it at all — they may use a screen reader to listen to the document.

Writing HTML is much more complicated than writing LaTeX if you take a broad view of all that is required to do it well: learning about accessibility and internationalization, keeping track of browser capabilities and market shares, adapting to evolving standards, etc. The closer you look into it, the less HTML has in common with LaTeX. The two languages are not simply two systems of markup; they address different problems.

Related links:

For daily tips on LaTeX and typography, follow @TeXtip on Twitter.