HTML entity data

It’s surprisingly hard to find a complete list of HTML entities in the form of a data file. There are numerous sites that give lists, often incomplete, in a page formatted to be human-readable but not machine-readable.

Here’s an XML file from the W3C.

Here’s a two-column text file I created from the W3C data.

Double-struck capital letters

I’ve needed to use double-struck capital letters lately, also called blackboard bold. There are a few quirks in how they are represented in Unicode and in HTML entities, so I’m leaving some notes for myself here and for anyone else who might need to look this up.

Unicode

The double-struck capital letters are split into two blocks for historical reasons. The double-struck capital letters used most often in math — ℂ, ℍ, ℕ, ℙ, ℚ, ℝ, ℤ — are located in the U+21XX range, while the rest are in the U+1D5XX range.

Low characters

The letters in the U+21XX range were the first to be included in the Unicode standard. There’s no significance to the code points.

ℂ U+2102
ℍ U+210D 
ℕ U+2115
ℙ U+2119 
ℚ U+211A
ℝ U+211D
ℤ U+2124 

The names, however are consistent. The official name of ℂ is

DOUBLE-STRUCK CAPITAL C

and the rest are analogous.

High characters

The code point for double-struck capital A is U+1D538 and the rest are in alphabetical order: the nth letter of the English alphabet has code point

0x1D537 + n.

However, the codes corresponding to the letters in the low range are missing. For example, U+1D53A, the code that would logically be the code for ℂ, is unused so as not to duplicate the codes in the low range.

The official names of these characters are

MATHEMATICAL DOUBLE-STRUCK CAPITAL C

and so forth. Note the “MATHEMTICAL” prefix that the letters in the low range do not have.

Incidentally, X is the 24th letter, so the new logo for Twitter has code point U+1D54F.

HTML Entities

All double-struck letters have HTML shortcuts of the form *opf; where * is the letter, whether capital or lower case. For example, is 𝔸 and is 𝕒.

The letters with lower Unicode values also have semantic entities as well.

ℂ ℂ ℂ
ℍ ℍ ℍ
ℕ ℕ ℕ
ℙ ℙ ℙ
ℚ ℚ ℚ
ℝ ℝ ℝ
ℤ ℤ ℤ

LaTeX

The LaTeX command for any double-struck capital letter is \mathbb{}. This only applies to capital letters.

Python

Here’s Python code to illustrate the discussion of Unicode values above.

def doublestrike(ch):

    exceptions = {
        'C' : 0x2102,
        'H' : 0x210D,
        'N' : 0x2115,
        'P' : 0x2119,
        'Q' : 0x211A,
        'R' : 0x211D,
        'Z' : 0x2124
    }
    if ch in exceptions:
        codepoint = exceptions[ch]
    else:
        codepoint = 0x1D538 + ord(ch) - ord('A')
    print(chr(codepoint), f"U+{format(codepoint, 'X')}")

for n in range(ord('A'), ord('Z')+1):
    doublestrike(chr(n))

Russian transliteration hack

I mentioned in the previous post that I had been poking around in HTML entities and noticed symbols for Fourier transforms and such. I also noticed HTML entities for Cyrillic letters. These entities have the form

& + transliteration + cy;.

For example, the Cyrillic letter П is based on the Greek letter Π and its closest English counterpart is P, and its HTML entity is П.

The Cyrillic letter Р has HTML entity &Rpcy; and not П because although it looks like an English P, it sounds more like an English R.

Just as a hack, I decided to write code to transliterate Russian text by converting letters to their HTML entities, then chopping off the initial & and the final cy;.

I don’t speak Russian, but according to Google Translate, the Russian translation of “Hello world” is “Привет, мир.”

Here’s my hello-world program for transliterating Russian.

    from bs4.dammit import EntitySubstitution

    def transliterate(ch):
        entity = escaper.substitute_html(ch)[1:]
        return entity[:-3]
    
    a = [transliterate(c) for c in "Привет, мир."]
    print(" ".join(a))

This prints

P r i v ie t m i r

Here’s what I get trying to transliterate Chebyshev’s native name Пафну́тий Льво́вич Чебышёв.

P a f n u t i j L soft v o v i ch CH ie b y sh io v

I put a space between letters because of possible outputs like “soft v” above.

This was just a fun hack. Here’s what I’d get if I used software intended to be used for transliteration.

    import unidecode

    for x in ["Привет, мир", "Пафну́тий Льво́вич Чебышёв"]:
        print(unidecode.unidecode(x))

This produces

Privet, mir
Pafnutii L’vovich Chebyshiov

The results are similar.

Related posts

Number slang and numbered lists

Here’s a list of five numbers used as slang in various contexts.

  1. Location (CB and police radio)
  2. End of column (journalism)
  3. Best wishes (ham radio)
  4. All aircraft in area (US Navy)
  5. I love you (text messages)

The motivation for this post was an article Those HTML attributes you never use. I wanted to make a note of how to change the numbering of an ordered list in HTML, and this post is that note.

Related posts

Look up HTML entity or Unicode

Fairly often I want to find out whether there is an HTML entity for a given Unicode character, or given an HTML entity I want to look up its Unicode value.

For example, ∃ has Unicode value U+2203. I might want to look up whether there is an HTML entity for this. There is, and it’s ∃.

Another example is ○ (U+25CB). There is no HTML entity for this character.

I imagine a few other people need to do this occasionally, so I make a web page to make it easier.

The page is very similar to my page for converting back and forth between Unicode and LaTeX commands.

I’ve written a few other online tools. Here’s a full list.

Update: If you enter a single character, the page will look up the Unicode value of the input. For example, entering ∀ is the same as entering U+2200.

Fractions in Unicode

There are Unicode characters for a few fractions, such as ½. This looks a little better than 1/2, depending on the context.

Here’s the Taylor series for log(1 + x) written in pure HTML:

log(1 + x) = x – ½x² + ⅓x³ – ¼x⁴ + ⅕x⁵ – ⋯

See this post for how the exponents were made.

Notice that the three dots ⋯ on the end are centered vertically, like \cdots in LaTeX. This was done with ⋯ (U+22EF).

Available fractions

The selection of available fraction number forms is small and a little strange.

There are characters for fractions with denominator d equal to 2, 3, 4, 5, 6, and 8, with numerators 1 through d-1, except for fractions that can be reduced.

If d = 7, 9, or 10, there’s a character for 1/d but not for fractions with numerators other than 1. For example, there is a character for ⅐ but not for 2/7.

HTML Entities

For denominators 2, 3, 4, 5, 6, and 8 the HTML entity for characters is easy: they all have the form

& frac <n> <d> ;

where n is the numerator and d is the denominator. For example, &frac35; is the HTML entity for ⅗.

There are no HTML entities for 1/7, 1/9, or 1/10.

Related posts

Number sets in HTML and Unicode

When I started blogging I was very cautious about what characters I used because browsers often didn’t have font support for uncommon characters. Things have changed since then and I’ve gotten less cautious. Nobody has complained, so I assume readers are seeing the characters I intend them to see.

There are Unicode characters for sets of numbers such as the integers and the real numbers, double-struck letters similar to the blackboard bold letters \mathbb{Z} etc. in LaTeX.

\mathbb{N} \quad \mathbb{Z} \quad \mathbb{Q} \quad \mathbb{R} \quad \mathbb{C} \quad \mathbb{H}

Here’s a table of the characters, their Unicode values, and two HTML entities associated with each.

    ℕ U+2115 &Nopf; &naturals;
    ℤ U+2124 &Zopf; &integers;
    ℚ U+211A &Qopf; &rationals;
    ℝ U+211D &Ropf; &reals;
    ℂ U+2102 &Copf; &complexes;
    ℍ U+210D &Hopf; &quaternions;

If you’re going to use these symbols, you will likely also need to use ∈ (U+2208, &in;) and ∉ (U+2209, &notin;).

More letters

If you want more letters in the style of those above, you can find them starting at U+1D538 for . However, the characters corresponding to letters above are reserved.

So for example, is U+1D538, is U+1D539, but U+1D53A is reserved and you must use ℂ (U+2102) instead.

One letter not mentioned above is ℙ (U+2119). It has HTML entities &Popf; and &primes;.

So the double-struck versions of C, H, N, P, Q, R, and Z are down in the BMP (Basic Multilingual Plane) and the rest are in the SMP (Supplementary Multilingual Plane). I suspect characters in the SMP are less likely to have font support, but that may not be a problem.

Unicode superscripts and subscripts

There are alternatives to using <sup> and <sub> tags for superscripts and subscripts in HTML. These alternatives may look better, depending on context, and they can be used in plain (Unicode) text files where HTML markup isn’t available.

Superscripts

When I started blogging I would use <sup>2</sup> and <sup>3</sup> for squares and cubes. Then somewhere along the way someone told me about &sup2; (U+00B2) and &sup3; (U+00B3) and I started using these. The superscript characters generally produce slightly smaller subscripts and look nicer in my opinion.

Example with sup tags:

a2 + b2 = c2

Example with superscript characters:

a² + b² = c²

But there are no characters for exponents larger than 3. Or so I thought until recently.

There are no HTML entities for exponents larger than 3, nothing written in notation analogous to &sup2; and &sup3;. There are Unicode characters for other superscripts up to 9, but they don’t have corresponding HTML entities.

The Unicode code point for superscript n is

2070hex + n

except for n = 2 or 3. For example, U+2075 is a superscript 5. So you could write x⁵ as

<em>x</em>&#x2075;.

Subscripts

There are also Unicode symbols for subscripts, though they don’t have corresponding HTML entities. The Unicode code point for superscript n = 0, 1, 2, … 9 is

2080hex + n

For example, U+2087 is a subscript 7.

The subscript characters don’t extend below the baseline as far as subscripts in <sub> tags do. Here are x‘s with subsubcripts in <sub> tags.

x0, x1, x2, x3, x4, x5, x6, x7, x8, x9

And here are the single character subscripts.

x₀, x₁, x₂, x₃, x₄, x₅, x₆, x₇, x₈, x

I think the former looks better, but subscripts in HTML paragraphs may increase the vertical spacing between lines. If consistent line spacing is more important than conventional subscripts, you might prefer the subscript characters.

Related posts