Classroom exercise with networks

In the previous post I looked at graphs created from representing geographic regions with nodes and connecting nodes with edges if the corresponding regions share a border.

It’s an interesting exercise to recover the geographic regions from the network. For example, take a look at the graph for the continental United States.

It’s easy to identify Alaska in the graph. The node on the left represents Maine because Maine is the only state to border exactly one other state. From there you can bootstrap your way to identifying the rest of the states.

Math class

This could make a fun classroom exercise in a math class. Students will naturally come up with the idea of the degree of a node, the number of edges that meet that node, because that’s a handy way to solve the puzzle: the only possibilities for a node of degree n are states that border n other states.

This also illustrates that networks preserve topology, not geometry. That is, the connectivity information is retained, but the shape is dramatically different.

Geography class

Someone asked me on Twitter to make a corresponding graph for Brazil. Mathematica, or at least my version of Mathematica, doesn’t have data on Brazilian states, so I made an adjacency graph using GraphViz.

adjacency graph of Brazilian states

Labeling the blank nodes is much easier for Brazil than for the US because Brazil has about half as many states, and the topology of the graph gives you more to work with. Three nodes connect to only one other node, for example.

Here the exercise doesn’t involve as much logic, but the geography is less familiar, unless of course you’re more familiar with Brazil than the US. Labeling the graph will require staring at a map of Brazil and you might accidentally learn a little about Brazil.

GraphViz

The labeled version of the graph above is available here. And here are the GraphViz source files that make the labeled and unlabeled versions.

The layout of a GraphViz file is very simple. The file looks like this:

    graph G {

        layout=sfdp

        AC [label="Acre"]
        AL [label="Alagoas"]
        ...
        AC -- AM
        AC -- RO
        ...
    }

There are three parts: a layout, node labels, and connections.

GraphViz has several layout engines, and the sfdp one matched what I was looking for in this case. Other layout options lead to overlapping edges that were confusing.

The node names AC, AL, etc. do not appear in the output. They’re just variable names for your convenience. The text inside the label is what appears in the final output. I’ll give an example in the next post in which it’s very convenient for the variables to be different from the labels. The order of the labels doesn’t matter, only which variables are associated with which labels.

Finally, the lines with variables separated by dashes are the connection data. Here we’re telling GraphViz to connect node AC to nodes AM and RO. The order of these lines doesn’t matter.

Related posts

Adjacency networks

Suppose you want to color a map with no two bordering regions having the same color. If this is a map on a plane, you can do this using only four colors, but maybe you’d like to use more.

You can reduce the problem to coloring the nodes in a graph. Each node corresponds to a region, and there is an edge between two nodes if and only if their corresponding regions share a border.

Here is a sort of topologists’s or graph theorist’s view of the continental United States.

This was created using the following sample code from the Mathematica documentation.

    RelationGraph[MemberQ[#2["BorderingStates"], #1] &, 
        EntityList[
            EntityClass["AdministrativeDivision", "ContinentalUSStates"]]]

You can recognize Maine in the graph because it’s the only state that only borders one other state. Alaska is also easy to locate. Exercise for the reader: mentally add Hawaii to the graph.

The analogous graph for Texas counties took much longer to draw: there are 49 continental US states but 254 Texas counties.

This was created with the following code.

    RelationGraph[MemberQ[#2["BorderingCounties"], #1] &, 
        EntityList[EntityClass["AdministrativeDivision", "USCountiesTexas"]]]

You can find El Paso county in the top left; it only borders one county just as Maine only borders one state.

Related posts

Shortest tours of Eurasia and Oceania

This is the final post in a series of three posts about shortest tours, solutions to the so-called traveling salesmen problem.

The first was a tour of Africa. Actually two tours, one for the continent and one for islands. See this post for the Mathematica code used to create the tours.

The second was about the Americas: one tour for the North American continent, one for islands, and one for South America.

This post will look at Eurasia and Oceania. As before, I limit the tours to sovereign states, though there are disputes over which regions are independent nations. I first tried to do separate tours of Europe and Asia, but this would require arbitrarily categorizing some countries as European or Asian. The distinction between Asia and Oceania is a little fuzzy too, but not as complicated.

Oceania

Here’s a map of the tour of Oceania.

Here’s the order of the tour:

  1. Australia
  2. East Timor
  3. Indonesia
  4. Palau
  5. Papua New Guinea
  6. Micronesia
  7. Marshall Islands
  8. Nauru
  9. Solomon Islands
  10. Vanuatu
  11. Fiji
  12. Tuvalu
  13. Kiribati
  14. Samoa
  15. Tonga
  16. New Zealand

The total length of the tour is 28,528 kilometers or 17,727 miles.

Eurasia

Here’s a map of the the Eurasian tour.

Here’s the order of the tour:

  1. Iceland
  2. Norway
  3. Sweden
  4. Finland
  5. Estonia
  6. Latvia
  7. Lithuania
  8. Belarus
  9. Poland
  10. Czech Republic
  11. Slovakia
  12. Hungary
  13. Romania
  14. Moldova
  15. Ukraine
  16. Georgia
  17. Armenia
  18. Azerbaijan
  19. Turkmenistan
  20. Uzbekistan
  21. Afghanistan
  22. Pakistan
  23. Tajikistan
  24. Kyrgyzstan
  25. Kazakhstan
  26. Russia
  27. Mongolia
  28. China
  29. North Korea
  30. South Korea
  31. Japan
  32. Taiwan
  33. Philippines
  34. East Timor
  35. Indonesia
  36. Brunei
  37. Malaysia
  38. Singapore
  39. Cambodia
  40. Vietnam
  41. Laos
  42. Thailand
  43. Myanmar
  44. Bangladesh
  45. Bhutan
  46. Nepal
  47. India
  48. Sri Lanka
  49. Maldives
  50. Yemen
  51. Oman
  52. United Arab Emirates
  53. Qatar
  54. Bahrain
  55. Saudi Arabia
  56. Kuwait
  57. Iran
  58. Iraq
  59. Syria
  60. Lebanon
  61. Jordan
  62. Israel
  63. Cyprus
  64. Turkey
  65. Bulgaria
  66. North Macedonia
  67. Serbia
  68. Bosnia and Herzegovina
  69. Montenegro
  70. Albania
  71. Greece
  72. Malta
  73. Italy
  74. San Marino
  75. Croatia
  76. Slovenia
  77. Austria
  78. Liechtenstein
  79. Switzerland
  80. Monaco
  81. Andorra
  82. Spain
  83. Portugal
  84. France
  85. Belgium
  86. Luxembourg
  87. Germany
  88. Netherlands
  89. Denmark
  90. United Kingdom
  91. Algeria

The total length of the tour is 61,783 kilometers or 38,390 miles.

Three tours of the Americas

The previous post looked at an optimal tour of continental Africa. This post will give analogous tours of continental North America, North American Islands, and South America. The next post looks at Eurasia and Oceania.

North American Continent

Here’s the North American continental tour.

The order of the tour is as follows.

  1. Canada
  2. United States
  3. Mexico
  4. Guatemala
  5. El Salvador
  6. Costa Rica
  7. Panama
  8. Nicaragua
  9. Honduras
  10. Belize

North American Islands

Here’s a tour of the North American islands.

Trinidad and Tabago is about ten miles from the South American continent, but the country is classified as being part of North America, at least for some purposes.

Here is the order of the tour.

  1. Cuba
  2. Jamaica
  3. Haiti
  4. Dominican Republic
  5. Grenada
  6. Trinidad and Tobago
  7. Barbados
  8. Saint Vincent and the Grenadines
  9. Saint Lucia
  10. Dominica
  11. Antigua and Barbuda
  12. Saint Kitts and Nevis
  13. Bahamas

South American tour

Here’s the tour of South America.

Here’s the order of the tour:

  1. Venezuela
  2. Guyana
  3. Suriname
  4. French Guiana
  5. Brazil
  6. Paraguay
  7. Uruguay
  8. Falkland Islands
  9. Argentina
  10. Chile
  11. Bolivia
  12. Peru
  13. Ecuador
  14. Colombia

A traveling salesman tour of Africa

Suppose you’d like to tour Africa, visiting each country once, then returning to your starting point, minimizing the distance traveled.

Here’s my first attempt at a solution using Mathematica, based on an example in the documentation for FindShortestTour.

    africa = CountryData["Africa"]
    FindShortestTour[africa]
    GeoGraphics[{Thick, Red, GeoPath[africa[[%[[2]]]]]}]


This produced the following map:

Hmm. Maybe I should have been more specific about what I mean by “Africa.” My intention was to find a tour of continental Africa, i.e. not including islands. This means I needed to remove several items from Mathematica’s list of African countries. Also, I had in mind sovereign states, not territories of overseas states and not disputed territories.

After doing this, the map is more like what I’d expect.

The tour is then

  1. Algeria
  2. Tunisia
  3. Libya
  4. Egypt
  5. Chad
  6. Central African Republic
  7. Democratic Republic of the Congo
  8. Burundi
  9. Rwanda
  10. Uganda
  11. South Sudan
  12. Sudan
  13. Eritrea
  14. Djibouti
  15. Somalia
  16. Ethiopia
  17. Kenya
  18. Tanzania
  19. Malawi
  20. Zambia
  21. Mozambique
  22. Zimbabwe
  23. Eswatini
  24. Lesotho
  25. South Africa
  26. Botswana
  27. Namibia
  28. Angola
  29. Republic of the Congo
  30. Gabon
  31. Equatorial Guinea
  32. Cameroon
  33. Nigeria
  34. Niger
  35. Mali
  36. Burkina Faso
  37. Benin
  38. Togo
  39. Ghana
  40. Ivory Coast
  41. Liberia
  42. Sierra Leone
  43. Guinea
  44. Guinea-Bissau
  45. Gambia
  46. Senegal
  47. Mauritania
  48. Morocco

The initial tour, including islands, foreign territories, and Western Sahara, was 23,744 miles or 38,213 kilometers. The second tour was 21,074 miles or 33915 kilometers.

Here’s a tour of just the islands, excluding foreign territories.

The order of the tour is

  1. Cape Verde
  2. Seychelles
  3. Mauritius
  4. Madagascar
  5. Comoros
  6. São Tomé and Príncipe

This tour is 13,034 miles or 20,976 kilometers.

Update: See the next two posts for tours of the Americas and Eurasia and Oceania.

Related posts

Direction between two cities

This post is the third in a series of posts on spherical trigonometry. We first looked at the analog of the Pythagorean theorem on a sphere, then the analog of the law of cosines on a sphere. Now we look at the analog of the law of sines on a sphere.

As before we denote the sides of a triangle with a, b, and c and denote the angle opposite a side by the capital version of that letter. So A is the angle opposite side a and so forth.

The law of sines on a sphere says

sin a / sin A = sin b / sin B = sin c / sin C.

Small triangles

Suppose for a moment that we’re looking at a relatively small triangle, one for which the length of the sides is small relative to the radius of the sphere. Then we can apply the small angle approximation

sin xx

to each side [1] and find

a / sin Ab / sin Bc / sin C.

In other words, the law of sines from plane trigonometry holds approximately in spherical trigonometry for small triangles. This is not surprising since we know we can ignore the curvature of the earth when working over relatively small areas, but it is reassuring.

Example: LAX to IAH

In the previous post we found the air distance between LAX (Los Angeles) and IAH (Houston) airports by making a spherical triangle with vertices at LAX, IAH, and the north pole. We labeled the side from LAX to the north pole a and the side from IAH to the north pole b. We found the length of side c using the law of cosines.

Now suppose we’re at LAX and about to fly to IAH. What direction should we fly? Since Los Angeles and Houston are approximately at the same latitude, and they’re not terribly far apart relative to the size of the earth, we know that we’ll roughly fly due east. Let’s find the flight direction more precisely.

The angle B in our triangle gives us the angle between our heading and a heading of due north. We know b from the latitude of IAH, we know C from the difference in longitude of the two airports, and we found c in the previous post. Therefore we can find B from the law of sines:

sin b / sin B = sin c / sin C

and we know everything but B. So

sin B = sin b sin C / sin c = sin 60.02° sin 23.07° / sin 19.92°.

This tells us sin B = 0.9960. So what is B? If we ask software for the arcsine of 0.9960 we will get 84.88°. This would say we head 5.12° north of due east to get from Los Angeles to Houston. But Houston is at a lower latitude than Los Angeles and so that doesn’t seem right.

Flight map from LAX to IAH

In the first post in this series I mentioned that a flight between Houston and Wuhan would fly over Alaska even though the two cities are at essentially the same latitude. Wuhan is at a slightly higher latitude than Houston, but a flight leaving Wuhan for Houston would head somewhat north. So it is possible that a flight from a higher latitude to a lower latitude would head north [2].

But Los Angeles is much closer to Houston than Wuhan is, and the difference in latitude is greater. A flight from LA to Houston would head mostly east and a little south. So where did we go wrong?

The arcsine function in any software package returns a solution to the equation

sin(x) = y,

but it might not return the solution we need. The angle 84.88° is a valid solution to the equation

sin(B) = 0.9960

but it’s not the appropriate solution in our context. We need the solution 95.12°. Our flight heada 5.12° south of east. This assumes our flight takes a great circle path; airlines may have reasons not to take the shortest possible path between two airports.

***

[1] The approximation assumes x is measured in radians. If we measure angles in degrees or any other units, the right side of the approximation will need to be multiplied by a constant. But the same constant will appear in all the numerators and denominators of the law of sines and so the approximation holds in any units.

[2] Here’s another illustration of the same point. Draw a circle around the north pole, say a circle a mile in diameter, and suppose you want to walk from longitude 0 to longitude 180°. Following a path of constant latitude is walking around the circle. Heading north instead is walking across the diameter of the circle. If your destination is a little bit outside the circle on the other side, it’s still shorter to walk across the circle, i.e. head north, even though your destination is at a lower latitude, i.e. south.

Numeric distance vs geographic distance in zip codes

If two zip codes numbers are close, are the regions they represent close? How much can you tell about how far apart two regions are by comparing their zip codes? (Zip codes are US postal codes. The name is an acronym for “Zone Improvement Plan” and was introduced in 1963.)

To investigate this, I looked at some data on zip codes that I purchased last year. I selected the zip codes with latitude and longitude coordinates [1] and computed the distance between consecutive zip codes, consecutive in the sense of being the next item in the list of assigned zip codes, not necessarily consecutive integers.

The closest consecutive zip codes are 10171 and 10172 in Manhattan. These zip codes are neighboring blocks and so you could say their distance is zero. The distance using the latitude and longitude values in my data is 0.05 miles.

The furthest consecutive zip codes are 96863 in Honolulu, Hawaii and 97001 in Antelope, Oregon. These locations are 2,658 miles apart.

The mean distance between consecutive zip codes is 34.4 miles and the median distance is 24.9 miles. So consecutive zip codes are often fairly close geographically, but not always.

Next I went back and divided the geographic distance between consecutive pair of zip codes by the numerical difference between the codes, taking a sort of “derivative”.

By this measure, the furthest consecutive zip codes are 99685 and 99686, which happen to also be consecutive integers. The both are in Alaska, the form former in Unalaska in the Aleutian Islands and the latter in Valdez on the mainland.

Map of Alaska showing Unalaska and Valdez

The mean and median distance in terms of miles per integer difference is 26.4 and the median is 16.9.

The previous post shows how you could use Hilbert curves to assign zip codes to a grid so that consecutive codes always correspond to neighboring squares.

Related posts

[1] A lot of zip codes don’t have coordinates because they serve an administrative function more than strictly representing a geographic region.

Zip codes, geocodes, and Hilbert curves

You might think that if zip codes are close, then the regions they represent are close. Or that if zip codes are consecutive, then their regions touch. Neither of these are true. I explore how far they are from being true in the next post.

But these statements could have been true [1]. It’s possible to lay a grid on a region and number the squares in the grid in such a way that consecutive numbers correspond to contiguous squares, and nearby numbers correspond to nearby squares. There are geocoding systems that accomplish this using Hilbert curves. For example, Google’s S2 system is based on Hilbert curves.

A Hilbert curve is a curve that winds through a square, coming arbitrarily close to every point in the square. A Hilbert curve is a fractal, defined as the limit of an iterative process. We aren’t concerned with the limit because we only want to carry out a finite number of steps in the construction. But the fact that the limit exists tells us that with enough steps we can get any resolution we want.

To illustrate Hilbert curves and how they could be used to label grids, we will use a Hilbert curve to tour a chessboard. We will number the squares of a chess board so that consecutive numbers correspond to neighboring squares.

Here’s Python code to produce our numbering.

    from hilbertcurve.hilbertcurve import HilbertCurve

    p = 3 # 2^3 squares on a side
    n = 2 # two dimensional cube, i.e. square

    hc = HilbertCurve(p, n)

    for row in range(2**p):
        for col in range(2**p):
            d = hc.distance_from_point([row, col])
            print(format(d,"2d"), " ", end="")
        print()

This produces the following output.

     0   1  14  15  16  19  20  21  
     3   2  13  12  17  18  23  22  
     4   7   8  11  30  29  24  25  
     5   6   9  10  31  28  27  26  
    58  57  54  53  32  35  36  37  
    59  56  55  52  33  34  39  38  
    60  61  50  51  46  45  40  41  
    63  62  49  48  47  44  43  42  

Here’s Python code to draw lines rather than print numbers.

    N = (2**p)**n
    points = hc.points_from_distances(range(N))

    for i in range(1, N):
        x0, y0 = points[i-1]
        x1, y1 = points[i]
        plt.plot([x0, x1], [y0, y1])
    plt.gca().set_aspect('equal')    
    plt.show()

This produces the following image.

Hilbert curve

There’s a small problem. Following the square numberings above produces a Hilbert curve, and the plotting code produces and Hilbert curve, but they’re not the same Hilbert curve. The implicit axes created by our sequence of print statements doesn’t match the geometry in the plotting code. To get the curves to match we need to do a sort of change of coordinates. We can do this by changing the call to plt.plot to

    plt.plot([y0, y1], [8-x0, 8-x1])

This produces the following plot, matching the numbering above.

Hilbert curve

Related posts

[1] It could have been true in the sense that it was theoretically possible. It might not have been practically possible. Postal codes have different design constraints than geocodes.

Decoding a grid square

I saw a reference last night to the grid square EL29fx and wanted to figure out where that is. There are many programs that will do this for you, but I wanted to do it by hand.

I wrote about how grid squares work a year ago, but I was rusty on the details, so this morning I went back and reread my earlier post.

Parsing a locator

The system used to create grid squares in amateur radio is the Maidenhead Locator System. Last year I wrote about why the system works the way it does. Here I will recap how the system works but refer you to the earlier post for the motivation behind the design choices.

First of all, the Maidenhead system interleaves the longitude and latitude information. So in the square EL29fx, the longitude information is (E, 2, f) and the latitude information is (L, 9, x). The first two characters, EL, gives a rough location, a “field.” Then the next two characters, 29, refine the location to a “square.” The last two, fx, narrow the square down to a “subsquare.”[1]

Finding the field

The first character in the grid square specifies a range of longitude east of the International Date Line [2]. Each letter represents a range of 20° longitude: A for 0 to 20° longitude, B for 20° to 40°, etc. So E represents 80° to 100° longitude east of the antemeridian or 260° to 280° east of the prime meridian.

The second character in the grid square represents latitude north of the south pole in increments of 10°. So L represents 110° to 120° north of the south pole, or 20° to 30° north of the equator.

Together, EL tells us we’re looking at some place 260° to 280° E and 20° to 30° N, i.e. somewhere in Mexico or the US Gulf Coast.

Finding the square

The two digits in the middle of a grid square divide the field into 10 parts in both directions. Since a field is 20° wide and 10° tall, the first digit represents multiples of 2° longitude and the second digit represent multiples of 1° latitude.

So if we look at EL29, we know our location is between 264° and 266° E and between 29° and 30° N, which is somewhere around Houston, Texas

Finding the subsquare

A subsquare divides a square into 24 parts in each direction. So a subsquare is 5 minutes of longitude wide and 2.5 minutes of latitude high.

The f in EL29fx refines the longitude is 264° 25′ E (95° 35′ W) and the x refines the latitude is 29° 55′ N. This narrows the location to somewhere in northwest Houston.

Northwest Houston

[1] There’s no significance to the case of the letters other than as a reminder that the first and second pair of letters are interpreted differently.

[2] Technically, degrees east of 180th meridian, a.k.a. the antemeridian, which mostly coincides with the International Date Line. The latter zigzags a little to avoid splitting island groups.

Information theory and coordinates

The previous post explains how the Maidenhead location system works. The first character in a location code specifies the longitude in 20 degree increments and the second character specifies the latitude in 10 degree increments. Both are a letter from A through R that breaks the possible range down into 18 parts. (The longitude range is wider because longitude ranges over 360 degrees but latitude ranges over 180 degrees.)

If you divide a range into 16 equally probable parts, you get four bits of information because 24 = 16. Since we’re dividing longitude and latitude into 18 parts, we get around 4 bits of information from each. But the longitude component carries more information than the latitude component. This post will explain why.

Defining information

The Shannon entropy of a random variable with N possible values, each with probability pi, is defined to be

-\sum_{i=1}^N p_i \log_b \, p_i

where the base b is usually 2. Occasionally b might be another base, such as 2 or 10. (More on that here.) The statement above about 16 equal parts giving 4 bits of information assumes b = 2.

The expected information gain from observing the value of a random variable is measured by the random variable’s Shannon entropy.

Even vs uneven allocation

Suppose we pick a point at random on a sphere. The first component of the point’s Maidenhead location narrows the location of the point to one of 18 equally probable sectors. So each of the ps in the equation above is 1/18. That means the information content in the first component is log2(1/18) = 4.17 bits.

The second component of the location also divides the sphere into 18 parts, but not evenly. A 10 degree band of latitude near the equator covers more area than a 10 degree band of latitude near the polls.

Shannon entropy is maximized when all the ps are equal. Said another way, the amount of surprise from observing a random variable is greatest when all possibilities are equally likely.

Knowing that a random point on a sphere is within 10 degrees of the equator is not as surprising as knowing that it is within 10 degrees of the North Pole because there’s less land within 10 degrees of the North Pole.

Archimedes and Shannon

To calculate the information content in the latitude band, we need to know the area of each band.

Archimedes (c. 287 – 212 BC) discovered that if you have a sphere sitting inside a cylinder, the area of a band on the sphere is proportional to the area of its projection onto the cylinder. That means the area of a band of latitude is proportional to the difference in the z coordinates of the band.

The 18 bands of latitude run from -90° to 90° in increments of 10°. If we number the bands starting with 1 at the South Pole, the probability of landing in the nth band is

(sin(-90° + 10n°) – sin(-90° + 10(n-1)°)/2

and you can calculate from this that the Shannon entropy is 3.97.

So the first component of the location, the longitude, carries 4.17 bits of information, and the second component, the latitude, carries 3.97 bits.