Reproducible randomized controlled trials

“Reproducible” and “randomized” don’t seem to go together. If something was unpredictable the first time, shouldn’t it be unpredictable if you start over and run it again? As is often the case, we want incompatible things.

But the combination of reproducible and random can be reconciled. Why would we want a randomized controlled trial (RCT) to be random, and why would we want it to be reproducible?

One of the purposes in randomized experiments is the hope of scattering complicating factors evenly between two groups. For example, one way to test two drugs on a 1000 people would be to gather 1000 people and give the first drug to all the men and the second to all the women. But maybe a person’s sex has something to do with how the drug acts. If we randomize between two groups, it’s likely that about the same number of men and women will be in each group.

The example of sex as a factor is oversimplified because there’s reason to suspect a priori that sex might make a difference in how a drug performs. The bigger problem is that factors we can’t anticipate or control may matter, and we’d like them scattered evenly between the two treatment groups. If we knew what the factors were, we could assure that they’re evenly split between the groups. The hope is that randomization will do that for us with things we’re unaware of. For this purpose we don’t need a process that is “truly random,” whatever that means, but a process that matches our expectations of how randomness should behave. So a pseudorandom number generator (PRNG) is fine. No need, for example, to randomize using some physical source of randomness like radioactive decay.

Another purpose in randomization is for the assignments to be unpredictable. We want a physician, for example, to enroll patients on a clinical trial without knowing what treatment they will receive. Otherwise there could be a bias, presumably unconscious, against assigning patients with poor prognosis if the physicians know the next treatment be the one they hope or believe is better. Note here that the randomization only has to be unpredictable from the perspective of the people participating in and conducting the trial. The assignments could be predictable, in principle, by someone not involved in the study.

And why would you want an randomization assignments to be reproducible? One reason would be to test whether randomization software is working correctly. Another might be to satisfy a regulatory agency or some other oversight group. Still another reason might be to defend your randomization in a law suit. A physical random number generator, such as using the time down to the millisecond at which the randomization is conducted would achieve random assignments and unpredictability, but not reproducibility.

Computer algorithms for generating random numbers (technically pseudo-random numbers) can achieve reproducibility, practically random allocation, and unpredictability. The randomization outcomes are predictable, and hence reproducible, to someone with access to the random number generator and its state, but unpredictable in practice to those involved in the trial. The internal state of the random number generator has to be saved between assignments and passed back into the randomization software each time.

Random number generators such as the Mersenne Twister have good statistical properties, but they also carry a large amount of state. The random number generator described here has very small state, 64 bits, and so storing and returning the state is simple. If you needed to generate a trillion random samples, Mersenne Twitster would be preferable, but since RCTs usually have less than a trillion subjects, the RNG in the article is perfectly fine. I have run the Die Harder random number generator quality tests on this generator and it performs quite well.



Image by Ilmicrofono Oggiono, licensed under Creative Commons

Joukowsky transformation

The Joukowsky transformation, or Joukowsky map, is a simple function that comes up in aerospace and electrical engineering calculations.

f(z) = \frac{1}{2} \left( z + \frac{1}{z} \right)

(Here z is a complex variable.) Despite its simplicity, it’s interesting to look at in detail.

Mapping lines and circles

Let zr exp(iθ) and let w = uiv be its image. Writing the Joukowsky transformation in terms of its real and complex parts makes it easier to understand how it transforms lines and circles.

r\exp(i\theta) \mapsto (u,v) = \left( \frac{1}{2} \left(r + \frac{1}{r}\right) \cos\theta, \frac{1}{2} \left(r - \frac{1}{r}\right) \sin\theta \right)

We can see how circles are mapped by holding r constant and letting θ vary. The unit circle gets crushed to the interval [-1, 1] on the real axis, traversed twice. Circles of radius ρ ≠ 1 get mapped to ellipses

\frac{u^2}{a^2} + \frac{v^2}{b^2} = 1


a &=& \frac{1}{2}\left(r + \frac{1}{r}\right) \\ b &=& \frac{1}{2}\left(r - \frac{1}{r}\right)

Next we hold θ constant and let r vary. If sin θ = 0 and cos θ > 0 then the image is [1, ∞). If sin θ = 0 and cos θ < 0 then the image is (-∞, -1]. In both cases the image is traced out twice. The image of the line with cos θ = 0 is the vertical axis. Otherwise lines through the origin are mapped to hyperbolas with equation

\frac{u^2}{\cos^2\theta} - \frac{v^2}{\sin^2\theta} = 1

Inverse functions

If (z + 1/z)/2 = w then z2 -2wz + 1 = 0. The discriminant of this equation is 4(w2 – 1) and so the Joukowsky transformation is 2-to-1 unless w = ± 1, in which case z = ± 1. The product of two roots of a quadratic equation equals the constant term divided by the leading coefficient. In this case, the product is 1. This says the Joukowski transformation is 1-to-1 in any region that doesn’t contain both z and 1/z. This is the case for the interior or exterior of the unit circle, or of the upper or lower half planes. In any of those four regions, one can invert the Joukowski transformation by solving a quadratic equation and choosing the correct root.

General birthday problem

The birthday problem, sometimes called the birthday paradox, says that it’s more likely than you’d expect that two people in a group have the same birthday. Specifically, in a random sample of 23 people, there’s about a 50-50 chance that two people share the same birthday.

The birthday problem makes a nice party trick, but generalizations of the problem come up frequently in applications. I wrote in the previous post how it comes up in seeding distributed Monte Carlo simulations. In computer science, it’s a concern in hashing algorithms.

If you have a set of N things to choose from, such as N = 365 birthdays, and take r samples, the probability that all r samples are unique is

p = \frac{N!}{N^r (N-r)!}

and the probability that at least two of the samples are the same is 1 – p. (This assumes that N is at least as big as r. Otherwise the denominator is undefined, but in that case we know p is 0.)

With moderately large values of N and r the formula is likely to overflow if implemented directly. So as usual the trick is to use logarithms to avoid overflow or underflow. Here’s how you could compute the probability above in Python. SciPy doesn’t have a log factorial function, but does have a log gamma function, so we use that instead.

    from scipy import exp, log
    from scipy.special import gammaln

    def prob_unique(N, r):
        return exp( gammaln(N+1) - gammaln(N-r+1) - r*log(N) )

Related: How to calculate binomial probabilities

Random number generator seed mistakes

Long run or broken software?

I got a call one time to take a look at randomization software that wasn’t randomizing. My first thought was that the software was working as designed, and that the users were just seeing a long run. Long sequences of the same assignment are more likely than you think. You might argue, for example, that the chances of flipping five heads in a row would be (1/2)5 = 1/32, but that underestimates the probability because a run could start at any time. The chances that the first five flips are heads would indeed be 1/32. But the probability of seeing five heads in a row any time during a series of flips is higher.

Most of the times that I’ve been asked to look at randomization software that “couldn’t be right,” the software was fine. But in this case, there wasn’t simply a long run of random results that happened to be equal. The results were truly constant. At least for some users. Some users would get different results from time to time, but others would get the same result every single time.

trick die

The problem turned out to be how the software set the seed in its random number generator. When the program started up it asked the user “Enter a number.” No indication of what kind of number or for what purpose. This number, unbeknownst to the user, was being used as the random number generator seed. Some users would enter the same number every time, and get the same randomization result, every time. Others would use more whimsy in selecting numbers and get varied output.

How do you seed a random number generator in a situation like this? A better solution would be to seed the generator with the current time, though that has drawbacks too. I write about that in another post.

Seeding many independent processes

A more subtle problem I’ve seen with random number generator seeding is spawning multiple processes that each generate random numbers. In a well-intentioned attempt to give each process a unique seed, the developers ended up virtually assuring that many of the processes would have exactly the same seed.

If you parallelize a huge simulation by spreading out multiple copies, but two of the processes use the same seed, then their results will be identical. Throwing out the redundant simulation would reduce your number of samples, but not noticing and keeping the redundant output would be worse because it would cause you to underestimate the amount of variation.

To avoid duplicate seeds, the developers used a random number generator to assign the RNG seeds for each process. Sounds reasonable. Randomly assigned RNG seeds give you even more random goodness. Except they don’t.

The developers had run into a variation on the famous birthday problem. In a room of 23 people, there’s a 50% chance that two people share the same birthday. And with 50 people, the chances go up to 97%. It’s not certain that two people will have the same birthday until you have 367 people in the room, but the chances approach 1 faster than you might think.

Applying the analog of the birthday problem to the RNG seeds explains why the project was launching processes with the same seed. Suppose you seed each process with an unsigned 16-bit integer. That means there are 65,536 possible seeds. Now suppose you launch 1,000 processes. With 65 times as many possible seeds as processes, surely every process should get its own seed, right? Not at all. There’s a 99.95% chance that two processes will have the same seed.

In this case it would have been better to seed each process with sequential seeds: give the first process seed 1, the second seed 2, etc. The seeds don’t have to be random; they just have to be unique. If you’re using a good random number generator, the outputs of 1,000 processes seeded with 1, 2, 3, …, 1000 will be independent.

Help with randomization

If you need help with randomization, please contact me. I can help you avoid subtle errors and have randomization procedures that will stand up to scrutiny.

Bounding a graph’s diameter by its spectrum

You can get upper and lower bounds on the diameter of a connected graph G from its spectrum.

If G has r distinct eigenvalues—whether of the adjacency matrix A, Laplacian L, or signless Laplacian Q—then the its diameter d is at most r-1. (Definitions of these matrices here.)

If G has n vertices then we have an lower bound on the diameter: d ≥ 4/nλ2 where λ2 is the algebraic connectivity, the second eigenvalue of the graph Laplacian.

Let Δ be the maximum degree of G. Then we also have the following upper bound:

d \leq 2 \left\lceil \sqrt{\frac{2\Delta}{\lambda_2}} \log_2 n \right\rceil

How well do these bounds work? Let’s look at a random graph with 50 vertices.

The diameter of this graph is 3. The highest vertex degree is 22.

Each of the three matrices A, L, and Q have 50 distinct eigenvalues. So that says the diameter should be less than 49, which it certainly is.

The lower bound based on algebraic connectivity is 0.0129, and the upper bound is 32.

Altogether, we estimated the diameter, whose true value is 3, to be between 0.0129 and 32. Which is correct, but not very tight.

Maybe the bounds weren’t very good because the diameter was small. Let’s look at graph with a much bigger diameter, a circle with 50 nodes.

Now this graph has diameter 25. The maximum node degree is 2, because every node has degree 2.

Each of the three matrices A, L, and Q have 26 distinct eigenvalues, which predicts that the graph diameter will be no more than 25. In this case, the upper bound is exact.

The algebraic connectivity gives a lower bound of 5.0727 for the diameter and an upper bound of 180. So while these bounds are correct, they are not especially tight either.

Source: Handbook of Linear Algebra


Graph regularity and Laplace eigenvalues

You can tell whether a graph is regular, or nearly regular, by looking at its eigenvalues.

Let G be a graph with n vertices and m edges and let L be the Laplacian matrix of G. Then the sum of the eigenvalues of L is 2m. (The diagonal of L contains the degrees of each node, so it’s trace is twice the number of edges. And the trace of a matrix is the sum of its eigenvalues.)

Let λi be the eigenvalues of the L. Then we have

\sum \lambda_i = 2m \leq \sqrt{n \sum \lambda_i(\lambda_i - 1)}

with equality if and only if G is regular. [reference]

We can try this out by starting with a dodecahedron graph. It’s a regular graph, since each vertex has three edges. There are a total of 30 edges and the eigenvalues sum to 60. The right-most expression is 60 as expected.

Next remove one of the edges from the graph. The result bears a remarkable resemblance to Iron Man.

Now there are 29 edges, so 2m = 58. The right-most expression above is now 58.31. It’s bigger than 58 because the new graph is not regular. But as expected, it’s close to 58 since the graph is close to regular.

Consulting for complex networks

Next areas of math to be applied

Not that long ago number theory was considered strictly pure math. Then came applications to cryptography. Now number theory is at the foundation of the online economy.

What are the next areas of pure math to find widespread application? Some people are saying algebraic topology and category theory.

[I saw a cartoon to this effect the other day but I can’t find it. If I remember correctly, someone was standing on a hill labeled “algebraic topology” and looking over at hills in the distance labeled with traditional areas of applied math. Differential equations, Fourier analysis, or things like that. If anybody can find that cartoon, please let me know.]

Algebraic topology

The big idea behind algebraic topology is to turn topological problems, which are hard, into algebraic problems, which are easier. For example, you can associate a group with a space, the fundamental group, by looking at equivalence classes of loops. If two spaces have different fundamental groups, they can’t be topologically equivalent. The converse generally isn’t true: having the same fundamental group does not prove two spaces are equivalent. There’s some loss of information going from topology to algebra, which is a good thing. As long as information you need isn’t lost, you get a simpler problem to work with.

Fundamental groups are easy to visualize, but hard to compute. Fundamental groups are the lowest dimensional case of homotopy groups, and higher dimensional homotopy groups are even harder to compute. Homology groups, on the other hand, are a little harder to visualize but much easier to compute. Applied topology, at least at this point, is applied algebraic topology, and more specifically applied homology because homology is practical to compute.

People like Robert Ghrist are using homology to study, among other things, sensor networks. You start with a point cloud, such as the location of sensors, and thicken the points until they fuse into spaces that have interesting homology. This is the basic idea of persistent homology.  You’re looking for homology that persists over some range of thickening. As the amount of thickening increases, you may go through different ranges with different topology. The homology of these spaces tells you something about the structure of the underlying problem. This information might then be used as features in a machine learning algorithm. Topological invariants might prove to be useful features for classification or clustering, for example.

Most applications of topology that I’ve seen have used persistent homology. But there may be entirely different ways to apply algebraic topology that no one is looking at yet.

Category theory

Category theory has been getting a lot of buzz, especially in computer science. One of the first ideas in category theory is to focus on how objects interact with each other, not on their internal structure. This should sound very familiar to computer scientists: focus on interface, not implementation. That suggests that category theory might be useful in computer science. Sometimes the connection between category theory and computer science is quite explicit, as in functional programming. Haskell, for example, has several ideas from category theory explicit in the language: monads, natural transformations, etc.

Outside of computer science, applications of category theory are less direct. Category theory can guide you to ask the right questions, and to avoid common errors. The mathematical term “category” was borrowed from philosophy for good reason. Mathematicians seek to avoid categorical errors, just as Aristotle and Kant did. I think of category theory as analogous to dimensional analysis in engineering or type checking in software development, a tool for finding and avoiding errors.

I used to be very skeptical of applications of category theory. I’m still skeptical, though not as much. I’ve seen category theory used as a smoke screen, and I’ve seen it put to real use. More about my experience with category theory here.

* * *

Topology illustration from Barcodes: The persistent topology of data by Robert Ghrist.

Category theory diagram from Category theory for scientists by David Spivak

Compressing ten years into six months

The other day I ran across a line from Peter Thiel saying that if you have a plan for where you’d like to be in ten years, ask yourself if you could get there in six months.

I don’t think he’s simply saying see if you can do everything 20 times faster. If you estimate something will take ten days, it probably will take more than half a day. We’re better at estimating things on the scale of days than on the scale of years.

If you expect to finish a project in ten days, you’re probably going to go about it the way you’ve approached similar projects before. There’s not a lot of time for other options. But there are a lot of ways to go about a decade-long project.

Since Thiel is well known for being skeptical of college education, I imagine one of the things he had in mind was starting a company in six months rather than going to college, getting an entry level job, then leaving to start your company.

As I wrote in an earlier post, some things can’t be done slowly.

Some projects can only be done so slowly. If you send up a rocket at half of escape velocity, it’s not going to take twice as long to get where you want it to go. It’s going to take infinitely longer.

Some projects have to be done quickly if they are going to be done at all. Software projects are usually like this. If a software project is expected to take two years, I bet it’ll take five, if it’s not cancelled before then. You have to deliver software faster than the requirements change. Of course this isn’t unique to software. To be successful, you have to deliver any project before your motivation or your opportunity go away.

Understanding a graph by peeling away nodes

One way to get some understanding of a graph is to peel away nodes, starting with the least connected. The k-core of a graph is the maximal subgraph such that every vertex has degree at least k. The k-shell is the set of vertices that are part of the k-core but not part of the (k+1)-core.

For example, the 0-core of a graph is simply the entire graph since every vertex has at least zero edges; a vertex can’t have a negative number of edges. The 1-core of a graph contains the vertices that are connected to other vertices. You form the 1-core by throwing away the 0-shell, the set of isolated vertices. You move from the 1-core to the 2-core by removing nodes of degree 1 until everything that’s left has degree at least 2.

NB: You do not form the k-core simply by discarding nodes of degree less than k. For example, consider the star graph below.

star graph with five peripheral nodes

If you peel off the the nodes of degree 1, eventually you have nothing left. There is no 2-core, even though the original graph had a node of degree 5. Or going back to the definition, there is no subgraph such that every vertex has degree 2, and so no maximal subgraph. The node of degree 5 only has degree five because it is connected to nodes of degree 1.

The k-cores of a complete graph of degree n are simply the entire graph, for k ≤ n. The k-shells are empty for k < n.

Keyword overlap

Let’s see what the k-cores and k-shells look like for some more complex graphs. First, let’s go back to the post looking at the overlap in keywords.  The edges of the graph are the 200 search terms that have lead visitors to the site at least twice. We connect two search terms if they share a word. The 56-core of the graph has 57 vertices, and the 57-core is empty. So the 56-core is a complete graph. Here’s the distribution of the k-shell sizes.

Random graphs

Next let’s look at a random graph. As in the post on spectra of random graphs, we look at an Erdős-Rényi graph Gn,p. First we start with n = 1000 and p = 0.01. So we start with 1000 nodes, and there’s a 0.01 that any particular pair of nodes is connected. Here are the core sizes for one such random graph:

[1000 1000  999  999  990  970  914  793]

So the 7-core has 793 nodes, then the 8-core is empty. I ran this ten times and the 7-core sizes ranged from size 709 to 849. But every time the 8-core was empty. The largest value of k such that the k-core is non-empty is called the degeneracy of the graph, so we could say the degeneracy was 7 every time.

I reran with n = 2000. The degeneracy was 14 every time. The 14-cores were in the range 1723 to 1810.

Finally I set n = 4000 and ran again. The degeneracy was 29 three times and 30 seven times. The final core sizes ranged from 3451 to 3714.

The experiments above suggest that the k-core size abruptly drops to zero, at a predictable value of k, and with a fairly predictable core size. There are papers on k-core size of random graphs, but I have not yet read any. (I have one in my to-read queue that someone let me know about after posting this article.)

* * *

Consulting for complex networks

New prime number record

Last year I wrote a couple posts about what was then the largest known prime, 257885161 – 1. Now there’s a new recordP = 274207281 – 1.

For most of the last 500 years, the largest known prime has been a Mersenne prime, a number of the form 2p – 1 where p is itself prime. Such numbers are not always prime. For example, 211 − 1 = 2047 = 23 × 89.

Euclid proved circa 300 BC that if M is Mersenne prime, then M(M + 1)/2 is a perfect number, i.e. it is equal to the sum of the divisors less than itself. Euler proved two millennia later that all even perfect numbers must have this form. Since no odd perfect numbers have been discovered, the discovery of a new Mersenne prime implies the discovery of a new perfect number.

Using the same methods as in the posts from last year, we can say some things about how P appears in various bases.

In binary, P is written as a string of 74,207,281 ones.

In base four, P is a 1 followed by 37,103,640 threes.

In base eight, P is a 1 followed by 24,735,760 sevens.

In base 16, P is a 1 followed by 18,551,820 F’s.

In base 10, P has 22,338,618 digits, the first of which is 3 and the last of which is 1.

Related post: Five interesting things about Mersenne primes

Bipartite graphs and the signless Laplacian

The vertices of a bipartite graph can be divided into two sets where edges only go from one set to the other; there are no edges within each set. For example, in the graph below edges connect blue to green nodes, but not blue to blue or green to green.

In this example there are two connected bipartite components. There is a variation on the graph Laplacian called the signless Laplacian whose eigenvalues can tell you how many bipartite components a graph has.

The graph Laplacian is LD – A where D is the diagonal matrix whose entries are the degrees of each node and A is the adjacency matrix. The signless Laplacian is QDA. Since D and A have only non-negative entries, so does Q.

The smallest eigenvalue of the Laplacian L is always zero, and the second smallest eigenvalue is positive if and only if the graph is connected. The smallest eigenvalue of the signless Laplacian Q is positive if and only if the graph is connected and bipartite. The multiplicity of zero as an eigenvalue equals the number of connected bipartite components.

The graph above has two connected bipartite components. Zero is a double eigenvalue of the signless Laplacian, and the next eigenvalue is 0.2603.

Next we connect two of the blue nodes. Now the graph is connected and bipartite.

Zero is now a single eigenvalue, and the second eigenvalue is 0.0968. This second eigenvalue is positive because there is only one component now, but it is small because the graph can almost be separated into two bipartite components.

Next we remove the edge connecting the two components, but we draw an edge between two of the green vertices.

Now there are two components, but only one of them is bipartite. The smallest eigenvalue is zero, with multiplicity 1, and the second eigenvalue is 0.2015.

Next we repeat the examples above with more edges. Now each component of the graph is a complete bipartite graph.

The smallest eigenvalue is 0, with multiplicity 2. But now the next eigenvalue is 3, This value is larger than before because the components are better connected.

As before, we now join two of the blue nodes.

Now there is one bipartite component. The smallest eigenvalue is 0 with multiplicity 1. The second eigenvalue is 0.2103.

Finally, we remove the edge connecting the two components and connect two of the green vertices.

Now the smallest eigenvalue is 0 with multiplicity 1. While there are two components, there’s only one bipartite component. The next eigenvalue is 0.4812.

Related posts:

Spectra of random graphs

Create a graph by first making a set of n vertices and then connecting nodes i and ji > j, with probability p. This is known as the Erdős-Rényi graph Gn,p. Here’s such a graph with n = 100 and p = 0.1.

Erdős-Rényi graph

Then the spectrum of the adjacency matrix A of the graph bounces around the spectrum of the expected value of A. This post will illustrate this to give an idea how close the two spectra are.

The adjacency matrix A for our random graph has zeros along the diagonal because the model doesn’t allow loops connecting a vertex with itself. The rest of the entries are 1 with probability p and 0 with probability 1-p. The entries above the diagonal are independent of each other, but the entries below the diagonal are determined by the entries above by symmetry. (An edge between i and j is also an edge between j and i.)

The expected value of A is a matrix with 0’s along the diagonal and p‘s everywhere else. This matrix has one eigenvalue of p(n – 1) and (n – 1) eigenvalues –p. Let’s see how close the spectra of a dozen random graphs come the spectra of the expected adjacency matrix. As above, n = 100 and p = 0.1.

Because the eigenvalues are close together, I jittered each sample horizontally to make the values easier to see. Notice that the largest eigenvalue in each sample is around 10. The expected value is p(n – 1) = 9.9, so the values are close to their expected value.

The other eigenvalues seem to be centered around 0. This is consistent with the expected value of  –p = -0.1.  The most obvious thing about the smaller eigenvalues is not their mean but their range. How far can might the eigenvalues be from their expected value? Fan Chung and Mary Radcliffe give the following estimate in their paper On the Spectra of General Random Graphs.

For  Gn,p, if  p > 8 log(√2 n)/9n, then with probability at least 1 – 1/n we have

|\lambda_i(A) - \lambda_i(p(J-I))| \leq \sqrt{8np \log(\sqrt{2}n)}

Here J is the matrix of all 1’s and I is the identity matrix. In our case the theorem says that with probability at least 0.99, we expect the eigenvalues of our random graph to be within 19.903 of their expected value. This is a pessimistic bound since no value is much more than 5 away from its expected value.

Matrix Pythagorean Triples

A Pythagorean triple is a list of positive integers (abc) such that a2b2c2. Euclid knew how to find all Pythagorean triples: pick two positive integers m and n with m > n and let

a = m2n2, b = 2mnc = m2 + n2.

Now what if we look at matrices with integer entries such that A2B2C2? Are there any? Yes, here’s an example:

\left[ \begin{array}{rr} 11 & -12 \\ 4 & -3 \end{array} \right]^2 + \left[ \begin{array}{rr} 36 & -48 \\ 16 & -20 \end{array} \right]^2 = \left[ \begin{array}{rr} 37 & -48 \\ 16 & -19\end{array} \right]^2

How can you find matrix-valued Pythagorean triples? One way is to create diagonal matrices with Pythagorean triple integers. Take a list of n Pythagorean triples (aibici) and create diagonal matrices A, B, and C with the ai along the diagonal of A, bi along the diagonal of B, and ci along the diagonal of C.

But that seems cheap. We’re really not using any properties of matrix multiplication other than the fact that you can square a diagonal matrix by squaring its elements. One way to create more interesting matrix Pythagorean triples is to start with the diagonal matrices described above and conjugate them by some matrix Q with integer entries such that Q-1 also has integer entries. Here’s why that works:

(QAQ^{-1})^2 + (QBQ^{-1})^2 &=& QA^2Q^{-1} + QB^2Q^{-1} \\ &=& Q(A^2 + B^2)Q^{-1} \\ &=& QC^2Q^{-1}

The example above was created by starting with the Pythagorean triples (3, 4, 5) and (5, 12, 13), creating diagonal matrices, and conjugating by

Q = \left[ \begin{array}{rr} 3 & 2 \\ 2 & 1 \end{array} \right]

This gives a way to create lots of matrix Pythagorean triples, but there may triples that this procedure leaves out.

Related: Approximations with Pythagorean triangles

Visualizing search keyword overlap

The other day someone asked me to look at the search data for Debug Pest Control, a pest management company based in Rhode Island. One of the things I wanted to visualize was how the search terms overlapped with each other.

To do this, I created a graph where the nodes are the keywords and edges join nodes that share a word. (A “keyword” is actually a set of words, whatever someone types into a search.) The result was a solid gray blob be cause the the nodes are so densely connected.

I thinned the graph by first only looking at keywords that have occurred at least twice. Then I made my connection criteria a little stronger. I take the union of the individual words in two keywords and create a connection if at least 1/3 of these words appear in both keywords.

(You can click on the image to see a larger version.)

The area of each red dot is proportional to the number of times someone has come to the site using that keyword. You can see that there’s long-tail behavior: while some dots are much larger than others, there are a lot of little dots.

In case you’re curious, these are the top keywords:

  1. pest control
  2. ri
  3. pest control ri
  4. pest control rhode island,
  5. plants that keep bees away
  6. poisonous snakes in ri
  7. plants that keep bees and wasps away
  8. plants to keep bees away
  9. plants that keep wasps away
  10. pest control companies


The keywords were filtered a little before making the graph. Some of the top keywords contained “debug.” These may have been people using the term to refer to eliminating pests from their property, but more likely they were people who knew the company name. By removing “debug” from each of the keywords, the results focus on people searching for the company’s services rather than the company itself.