Visualizing correlations with graphs

Yesterday I found a statistics textbook for geologists [1] for $1 at a library book sale. When I thumbed through the book an image similar to the one below caught my eye.

This image approximates Figure 15.2 in [1],

The nodes represent six factors of the thickness of rock formations and the edges are labeled with the correlations between factors. Only large correlations are shown. For example, in theory everything is correlated with “total” but carbonates are not significantly correlated with the total. Nonclastics divide into evaporates and carbonates; apparently nearly all the nonclastics in this data set were evaporites.

Notice that this example illustrates that correlation is not transitive. That is, if A is correlated with B and B is correlated with C, it does not follow that A is necessarily correlated with C.

Making the graph

I made the graph above with GraphViz using the following code.

    graph G {
    
    layout=neato
    
    T [label="Total"      , pos="2.50, 5.00!"]
    S [label="Sand"       , pos="4.66, 3.75!"]
    C [label="Carbonates" , pos="4.66, 1.25!"]
    E [label="Evaporites" , pos="2.50, 0.00!"]
    N [label="Nonclastics", pos="0.39, 1.25!"]
    H [label="Shale"      , pos="0.39, 3.75!"]
    
    T -- S [label=" 0.24 "]
    T -- H [label=" 0.89 "]
    T -- N [label=" 0.84 "]
    T -- E [label=" 0.82 "]
    H -- N [label=" 0.69 "]
    H -- E [label=" 0.70 "]
    S -- C [label=" 0.45 "]
    N -- E [label=" 0.99 "]

    }

I’ve mostly used GraphViz to make graphs when I didn’t care much about the layout. I’ve experimented with a few layout engines, but I hadn’t tried specifying the node positions before.

The nodes in the original graph were arranged in a circle, so I tried the circo layout engine. This did not position the nodes in a circle. I also tried specifying the positions without the bang on the end, giving the positions as layout hints. GraphViz did not appreciate my suggestions and was certain that it knew better how to layout the graph. But when I added the exclamation marks GraphViz acquiesced to my wishes.

GraphViz will create output in a variety of formats. I tried PNG and SVG. The SVG image above was 11 times smaller than the PNG output. One reason I starting using SVG images more often is that they often result in smaller files. They also look very nice at multiple resolutions, i.e. on a desktop and on a mobile device.

Related posts

[1] Krumbein and Graybill. An Introduction to Statistical Models in Geology. McGraw-Hill, 1965.

Katakana, Hiragana, and Unicode

I figured out something that I wasn’t able to find by searching, so I’m posting it here in case other people have the same question and the same difficulty finding an answer.

I’m sure other people have written about this, but I couldn’t find it. Maybe lots of people have written about this in Japanese but not many in English.

Japanese kana consists of two syllabaries, hiragana and katakana, that are like phonetic alphabets. Each has 46 basic characters, and each corresponds to a block of 96 Unicode characters. I had two simple questions:

  1. How do the 46 characters map into the 90 characters?
  2. Do they map the same way for both hiragana and katakana?

Hiragana / katakana correspondence

I’ll start with the second question because it’s easier. Hiragana and katakana are different ways of representing the same sounds, and they correspond one to one. For example, the full name of U+3047 () is

HIRAGANA LETTER SMALL E

and the full name of its katakana counterpart U+30A7 () is

KATAKANA LETTER SMALL E

The only difference as far as Unicode goes is that katakana has three code points whose hiragana counterpart is unused, but these are not part of the basic letters.

The following Python code shows that the names of all the characters are the same except for the name of the system.

    from unicodedata import name

    unused = [0, 151, 152] # not in hiragana

    for i in range(0,63):
        if i in unused:
            continue
        h = name(chr(0x3040 + i)) 
        k = name(chr(0x30a0 + i))
        assert(h == k.replace("KATAKANA", "HIRAGANA"))
    print("done")

Mapping 46 into 50 and 96

You’ll see kana written in grid with one side labeled with 5 vowels and the other labeled with 10 consonants called a gojūon (五十音). That’s 50 cells, and in fact gojūon literally means 50 sounds, so how do we get 46? Five cells are empty, and one letter doesn’t fit into the grid. The empty cells are unused or archaic, and the extra character doesn’t fit the grid structure.

In the image below, the table on the left is for hiragana and the table on the right is for katakana. HTML versions of the tables available here.

Left out of each table is in hiragana and in katakana.

So does each set of 46 characters map into its Unicode code block?

Unicode numbers the letters consecutively if you traverse the grid increasing vowels first, then consonants, and adding the straggler at the end. But the reason 46 letters expand into more code points is that each letter can have one, two, or three variations. And there are various miscellaneous other symbols in the Unicode block.

For example, there is a LETTER E as well as the SMALL LETTER E mentioned above. Other variations seem to correspond to voiced and unvoiced versions of a consonant with a phonetic marker added to the voiced version. For example, く is U+304F, HIRAGANA LETTER KU, and ぐ is U+3050, HIRAGANA LETTER GU.

Here is how hiragana maps into Unicode. Each cell should be U+3000 plus the characters show.

         a  i  u  e  o 
        42 44 46 48 4A 
     k  4B 4D 4F 51 53 
     s  55 57 59 5B 5D 
     t  5F 61 64 66 68 
     n  6A 6B 6C 6D 6E 
     h  6F 72 75 78 7B 
     m  7E 7F 80 81 82 
     y  84    86    88 
     r  89 8A 8B 8C 8D 
     w  8F          92 

The corresponding table for katakana is the previous table plus 0x60:

         a  i  u  e  o 
        A2 A4 A6 A8 AA 
     k  AB AD AF B1 B3 
     s  B5 B7 B9 BB BD 
     t  BF C1 C4 C6 C8 
     n  CA CB CC CD CE 
     h  CF D2 D5 D8 DB 
     m  DE DF E0 E1 E2 
     y  E4    E6    E8 
     r  E9 EA EB EC ED 
     w  EF          F2 

In each case, the letter missing from the table is the next consecutive value after the last in the table, i.e. is U+30F3.

Related posts

Room squares and Tournaments

A Room square is a variation on a Latin square. Room squares are named after Thomas Room, though there is an application to rooms as in compartments of a building that we’ll discuss below.

In a Latin square of size n you have to assign one of n symbols to each cell so that each symbol appears exactly once in each row and column. A Room square is sort of a Latin square with pairs.

Each cell of a Room square is either empty or contains an unordered pair of symbols. Each symbol appears exactly once in each row and column, and every possible pair of symbols appears in some cell.

The following graphic shows a 7 × 7 Room square with eight symbols, each represented by a different color.

If you have trouble seeing some of the colors, take a look at the code below that made the image.

An n × n Room square corresponds to a Round Robin tournament schedule with n + 1 players. Each row represents a location (or room), and each column represents a round. Each player plays one game in each round, and each pair of players plays in exactly one location.

Python code

Here’s the code I wrote to create the image above.

    import matplotlib.pyplot as plt
    
    colors = ["red", "orange", "yellow", "green", "blue", "black", "gray", "purple"]
    up, dn = True, False
    
    def draw_triangle(row, col, value, pos):
        if pos == up:
            x = [col, col, col+1]
            y = [row, row+1, row+1]
        else:
            x = [col, col+1, col+1]
            y = [row, row, row+1]
        plt.fill(x, y, colors[value])
    
    sq = [
        (1, 3, 0, 4),
        (1, 5, 3, 5),
        (1, 6, 1, 2),
        (1, 7, 7, 6),
        (2, 2, 6, 3),
        (2, 4, 2, 4),
        (2, 5, 0, 1),
        (2, 6, 7, 5),
        (3, 1, 5, 2),
        (3, 3, 1, 3),
        (3, 4, 6, 0),
        (3, 5, 7, 4),
        (4, 2, 0, 2),
        (4, 3, 5, 6),
        (4, 4, 7, 3),
        (4, 7, 4, 1),
        (5, 1, 6, 1),
        (5, 2, 4, 5),
        (5, 3, 7, 2),
        (5, 6, 3, 0),
        (6, 1, 3, 4),
        (6, 2, 7, 1),
        (6, 5, 2, 6),
        (6, 7, 5, 0),
        (7, 1, 7, 0),
        (7, 4, 1, 5),
        (7, 6, 4, 6),
        (7, 7, 2, 3)
    ]
        
    for t in sq:
        draw_triangle(t[0], t[1], t[2], up)
        draw_triangle(t[0], t[1], t[3], dn)    
    plt.grid()
        
    plt.gca().set_aspect("equal")
    plt.show()

Related post: Balanced tournament designs

HCPCS (“hick pics”) codes

HCPCS stands for Healthcare Common Procedure Coding System. HCPCS codes are commonly pronounced like “hick pics.” I was curious/afraid to see what DALL-E would create with the prompt “HCPCS hick pics” and was pleasantly surprised that it produced the image above. More on that latter.

Searching for medical codes

I occasionally need to search for medical codes—ICD-9 codes, ICD-10 codes, SNOMED codes, etc.—including HCPCS codes. Here’s some data that is supposed to contain a certain kind of code; does it? Or, more importantly, this text is supposed to have been scrubbed of medical codes; was it?

The most accurate way to search for HCPCS codes would be to have a complete list of codes and search for each one. There are a couple reasons why this isn’t so useful. First of all, I’m doing a quick scan, not writing data validation software. So a quick-and-dirty regular expression is fine. In fact, it’s better: I may be more interested in finding things that look like HCPCS codes than actual HCPCS codes. The former would include some typos or codes that were added after my script was written.

Copyright

Another advantage of using a regular expression that approximates HCPCS codes is that HCPCS codes are copyrighted. I don’t know whether a script using a list of HCPCS codes would be fair use, but it doesn’t matter if I’m using a regular expression.

According to Wikipedia, “CPT is a registered trademark of the American Medical Association, and its largest single source of income.” CPT is part of HCPCS, namely Level I. It is controversial that CPT codes are a commercial product whose use is required by law.

Code patterns

There are three levels of HCPCS codes. Level I is synonymous with CPT and consists of five-digit numbers. It’s easy to search for five-digit numbers, say with the regular expression \d{5}, but of course many five-digit numbers are not medical codes. But if text that has supposedly been stripped of medical codes (and zip codes) has a lot of five-digit numbers, it could be something to look into.

Level II and Level III codes are five-character codes consisting of four digits and either an F or a T. So you might search on \d{4}[FT]. While a five-digit number is likely to be a false positive when searching for HCPCS codes, four-digits following by an F or T is more likely to be a hit.

About the image

I edited the image that DALL-E produced two ways. First, I cropped it vertically; the original image had a lot more grass at the bottom.

Second, I lowered the resolution. The images DALL-E generates are very large. Even after cutting off the lower half of the image, the remaining part was over a megabyte. I used Squoosh to reduce the image to 85 kb. Curiously, when I opened the squooshed output in GIMP and lowered the resolution further, the image actually got larger, so went back to the output from Squoosh.

The image looks more natural than anything I’ve made using DALL-E. I look at the image and think it would be nice to be there. But if you look at the barn closely, it has a little of that surreal quality that DALL-E images usually have. This is not a photograph or a painting.

This discussion of AI-generated images is a bit of a rabbit trail, but it does tie in with searching for medical codes. Both are a sort of forensic criticism: is this what it’s supposed to be or is something out of place? When looking for errors or fraud, you have to look at data similarly to the way you might analyze an image.

Related posts

Dominoes in Unicode

I was spelunking around in Unicode and found that there are assigned characters for representing domino tiles and that the characters are enumerated in a convenient order. Here is the code chart.

There are codes for representing tiles horizontally or vertically. And even though, for example, the 5-3 is the same domino as the 3-5, there are separate characters for representing the orientation of the tile: one for 3 on the left and one for 5 on the left.

When you include orientation like this, a domino becomes essentially a base 7 number: the number of spots on one end is the number of 7s and the number of spots on the other end is the number of 1s. And the order of the characters corresponds to the order as base 7 numbers:

0-0, 0-1, 0-2, …, 1-0, 1-1, 1-2, … 6-6.

The horizontal dominoes start with the double blank at U+1F031 and the vertical dominoes start with U+1F063, a difference of 32 in hexadecimal or 50 in base 10. So you can rotate a domino tile by adding or subtracting 50 to its code point.

The following tiny Python function gives the codepoint for the domino with a spots on the left (or top) and b spots on the right (or bottom).

    def code(a, b, wide):
        cp = 0x1f031 if wide else 0x1f063
        return cp + 7*a + b

We can use this function to print a (3, 5) tile horizontally and a (6, 4) tile vertically.

    print( chr(code(3, 5, True )),
           chr(code(6, 4, False)) )

To my surprise, my computer had the fonts installed to display the results. This isn’t guaranteed for such high Unicode values.

horizontal 3-5 domino and vertical 6-4

Field of order 9

This post will give a detailed example of working in a field with nine elements. This is important because finite fields are not often treated concretely except for the case of prime order.

In my first post on Costas arrays I mentioned in a footnote that Lempel’s algorithm works more generally over any finite field, but I stuck to integers modulo primes to make things simpler.

In this post we’ll create a 7 × 7 Costas array by working in a field with 9 elements. Along the way we’ll explain in great detail how finite field arithmetic works, using the field with 9 elements as an example.

Continue reading

Costas arrays in Mathematica

A couple days ago I wrote about Costas arrays. In a nutshell, a Costas array of size n is a solution to the n rooks problem, with the added constraint that if you added wires between the rooks, no two wires would have the same length and slope. See the earlier post for more details.

The earlier post implemented the Lempel algorithm in Python. Here we’ll implement it in Mathematica. Lempel’s algorithm says to start with parameters p and a where p is a prime and a is a primitive root mod p [1]. Then you can create a Costas array of size n = p-2 by filling in the (i, j) square if and only if

ai + aj = 1 mod p.

We can implement this in Mathematica by

    fill[i_, j_, p_, a_] := If[Mod[a^i + a^j, p] == 1, 1, 0]
    m[p_, a_] := Table[fill[i, j, p, a], {i, 1, p - 2}, {j, 1, p - 2}]

We could visualize the Costas array generated by p = 11 and a = 2 using ArrayPlot. The code

    ArrayPlot[m[11, 2], Mesh -> All]

Generates the following image.

Note that the image is symmetric with respect to the main diagonal. That’s because our algorithm is symmetric in i and j. But Costas arrays are not always symmetrical, so this underscores the point that there is no known scalable algorithm for finding all Costas arrays.

In this example we started with knowing that 2 was a primitive root mod 11. We could have had Mathematica pick a primitive root mod p for us by calling PrimitiveRoot[p]

Just out of curiosity, let’s redo the example above, but instead of testing whether

ai + aj = 1 mod p.

we’ll just use ai + aj mod p.

The code

    ArrayPlot[Table[Mod[2^i + 2^j, 11], {i, 1, 9}, {j, 1, 9}]]

produces the following image.

[1] Lempel’s algorithm generalizes to finite fields of any order, not just integers modulo a prime.

Two-letter vs Three-letter Country Abbreviations

The ISO 3166-1 standard defines three codes for each country: a 2-letter abbreviation, a 3-letter abbreviation, and a 3-digit code.

The 2-letter abbreviations may be familiar because it is very often (but not always [1]) also the country code top-level domain (ccTLD). For example, AU is the ISO abbreviation for Australia, and .au is the ccTLD.

I was curious about the relation between the two-letter and three-letter abbreviations. Sometimes the former is a prefix of the latter, such as US and USA. Sometimes the latter adds a letter in the middle, such as going from CN to CHN. How common are these two patterns?

I wrote a script using the iso3166 Python module to find out.

Turns out that the prefix pattern is most common, and occurs 63% of the time.

The infix pattern is the next most common, occurring 29% of the time.

Suffixes are rare. There are only for instances where the 2-letter name is the last two letters of the 3-letter name: ATF, MYT, SPM, and SGS.

The remaining possibilities are miscellaneous relations, such as IL and ISR for Israel.

Here’s a table of countries and ISO codes in plain text (org-mode) format.

[1] There are four ccTLDs that are not ISO 3-166-1 alpha-2 names: uk, su, ac, and eu.

Finding similar world flags with Mathematica

A week ago I posted some pairs of similar flags on Twitter, and later I found that Mathematica’s CountryData database contains flag descriptions. So I thought I’d use the flag descriptions to see which flags Mathematica things are similar.

For example, the FlagDescription attribute for Chad in Mathematica is

Three equal vertical bands of blue (hoist side), yellow, and red; similar to the flag of Romania; also similar to the flags of Andorra and Moldova, both of which have a national coat of arms centered in the yellow band; design was based on the flag of France.

I had Mathematica output a list of countries and flag descriptions, then searched the output for the word “similar.” I then made the following groupings based on the output [1].

Chad / Romania

Chad  

Bolivia / Ghana

Bolivia   Ghana

Colombia / Ecuador

Equador   Columbia

India / Niger

India  

Ireland / Côte d’Ivoire

Ireland   Ivory Coast

El Salvador / Nicaragua / Honduras

El Salvador    

Egypt / Iraq / Syria / Yemen

  Iraq

 

Luxembourg / The Netherlands

 

Andorra / Moldova
Andorra  

Indonesia / Monaco

Indonesia   Monaco

Emoji

Each flag has an emoji, so here are the groupings above using emoji icons

  • 🇹🇩 🇷🇴
  • 🇧🇴 🇬🇭
  • 🇨🇴 🇪🇨
  • 🇮🇳 🇳🇪
  • 🇮🇪 🇨🇮
  • 🇸🇻 🇳🇮 🇭🇳
  • 🇪🇬 🇮🇶 🇸🇾 🇾🇪
  • 🇱🇺 🇳🇱
  • 🇦🇩 🇲🇩
  • 🇮🇩 🇲🇨

Related posts

[1] The groupings are based on Mathematica’s output, but I did some editing. Strictly following Mathematica’s descriptions would have been complicated. For example, Mathematica’s description might say A is similar to B, but not say B is similar to A. Or it might cluster four flags together that could better be split into two pairs.