Uncategorized

Precise answers to useless questions

I recently ran across a tweet from Allen Downey saying

So much of 20th century statistics was just a waste of time, computing precise answers to useless questions.

He’s right. I taught mathematical statistics at GSBS [1, 2] several times, and each time I taught it I became more aware of how pointless some of the material was.

I do believe mathematical statistics is useful, even some parts whose usefulness isn’t immediately obvious, but there were other parts of the curriculum I couldn’t justify spending time on [3].

Fun and profit

I’ll say this in partial defense of computing precise answers to useless questions: it can be fun and good for your career.

Mathematics problems coming out of statistics can be more interesting, and even more useful, than the statistical applications that motivate them. Several times in my former career a statistical problem of dubious utility lead to an interesting mathematical puzzle.

Solving practical research problems in statistics is hard, and can be hard to publish. If research addresses a practical problem that a reviewer isn’t aware of, a journal may reject it. The solution to a problem in mathematical statistics, regardless of its utility, is easier to publish.

Private sector

Outside of academia there is less demand for precise answers to useless questions. A consultant can be sure that a client finds a specific project useful because they’re willing to directly spend money on it. Maybe the client is mistaken, but they’re betting that they’re not.

Academics get grants for various lines of research, but this isn’t the same kind of feedback because the people who approve grants are not spending their own money. Imagine a grant reviewer saying “I think this research is so important, I’m not only going to approve this proposal, I’m going to write the applicant a $5,000 personal check.”

Consulting clients may be giving away someone else’s money too, but they have a closer connection to the source of the funds than a bureaucrat has when dispensing tax money.

Notes

[1] When I taught there, GSBS was The University of Texas Graduate School of Biomedical Sciences. I visited their website this morning, and apparently GSBS is now part of, or at least co-branded with, MD Anderson Cancer Center.

There was a connection between GSBS and MDACC at the time. Some of the GSBS instructors, like myself, were MDACC employees who volunteered to teach a class.

[2] Incidentally, there was a connection between GSBS and Allen Downey: one of my students was his former student, and he told me what a good teacher Allen was.

[3] I talk about utility a lot in this post, but I’m not a utilitarian. There are good reasons to learn things that are not obviously useful. But the appeal of statistics is precisely its utility, and so statistics that isn’t useful is particularly distasteful.

Pure math is beautiful (and occasionally useful) and applied math is useful (and occasionally beautiful). But there’s no reason to study fake applied math that is neither beautiful or useful.

Pairs in poker

An article by Y. L. Cheung [1] gives reasons why poker is usually played with five cards. The author gives several reasons, but here I’ll just look at one reason: pairs don’t act like you might expect if you have more than five cards.

In five-card poker, the more pairs the better. Better here means less likely. One pair is better than no pair, and two pairs is better than one pair. But in six-card or seven-card poker, a hand with no pair is less likely than a hand with one pair.

For a five-card hand, the probabilities of 0, 1, or 2 pair are 0.5012, 0.4226, and 0.0475 respectively.

For a six-card hand, the same probabilities are 0.3431, 0.4855, and 0.1214.

For a seven-card hand, the probabilities are 0.2091, 0.4728, and 0.2216.

Related posts

[1] Y. L. Cheung. Why Poker Is Played with Five Cards. The Mathematical Gazette, Dec., 1989, Vol. 73, No. 466 (Dec., 1989), pp. 313–315

Solar system means

Yesterday I stumbled on the fact that the size of Jupiter is roughly the geometric mean between the sizes of Earth and the Sun. That’s not too surprising: in some sense (i.e. on a logarithmic scale) Jupiter is the middle sized object in our solar system.

What I find more surprising is that a systematic search finds mean relationships that are far more accurate. The radius of Jupiter is within 5% of the geometric mean of the radii of the Earth and Sun. But all the mean relations below have an error less than 1%.

\begin{eqnarray*} R_\Mercury &=& \mbox{GM}\left(R_\Moon, R_\Mars\right) \\ R_\Mars &=& \mbox{HM}\left(R_\Moon, R_\Jupiter\right) \\ R_\Uranus &=& \mbox{AGM}\left(R_\Earth, R_\Saturn\right) \\ \end{eqnarray*}

The radius of Mercury equals the geometric mean of the radii of the Moon and Mars, within 0.052%.

The radius of Mars equals the harmonic mean of the radii of the Moon and Jupiter, within 0.08%.

The radius of Uranus equals the arithmetic-geometric mean of the radii of Earth and Saturn, within 0.0018%.

See the links below for more on AM, GM, and AGM.

Now let’s look at masses.

\begin{eqnarray*} M_\Earth &=& \mbox{GM}\left(M_\Mercury, M_\Neptune\right) \\ M_\Pluto &=& \mbox{HM}\left(M_\Moon, M_\Mars\right) \\ M_\Uranus &=& \mbox{AGM}\left(M_\Moon, M_\Saturn\right) \\ \end{eqnarray*}

The mass of Earth is the geometric mean of the masses of Mercury and Neptune, within 2.75%. This is the least accurate approximation in this post.

The mass of Pluto is the harmonic mean of the masses of the Moon and Mars, within 0.7%.

The mass of Uranus is the arithmetic-geometric mean of the masses of the Moon and Saturn, within 0.54%.

Related posts

Are guidance documents laws?

Are guidance documents laws? No, but they can have legal significance.

The people who generate regulatory guidance documents are not legislators. Legislators delegate to agencies to make rules, and agencies delegate to other organizations to make guidelines. For example [1],

Even HHS, which has express cybersecurity rulemaking authority under the Health Insurance Portability and Accountability Act (HIPAA), has put a lot of the details of what it considers adequate cybersecurity into non-binding guidelines.

I’m not a lawyer, so nothing I can should be considered legal advice. However, the authors of [1] are lawyers.

The legal status of guidance documents is contested. According to [2], Executive Order 13892 said that agencies

may not treat noncompliance with a standard of conduct announced solely in a guidance document as itself a violation of applicable statutes or regulations.

Makes sense to me, but EO 13992 revoked EO 13892.

Again according to [3],

Under the common law, it used to be that government advisories, guidelines, and other non-binding statements were non-binding hearsay [in private litigation]. However, in 1975, the Fifth Circuit held that advisory materials … are an exception to the hearsay rule … It’s not clear if this is now the majority rule.

In short, it’s fuzzy.

Update: On June 28, 2024 the US Supreme Court overruled the 1984 decision Chevron v. Natural Resources Defense Council. The Chevron doctrine held that courts should defer to regulatory agencies regarding the interpretation of regulations. In a pair of decisions, the Supreme Court held that the Chevron doctrine was a violation of separation of powers, giving executive agencies judicial power.

 

[1] Jim Dempsey and John P. Carlin. Cybersecurity Law Fundamentals, Second Edition, page 245.

[2] Ibid., page 199.

[3] Ibid., page 200.

 

Breach Safe Harbor

In the context of medical data, Safe Harbor typically refers to the Safe Harbor provisions of the HIPAA Privacy Rule explained here. Breach Safe Harbor is a little different. It basically means you’re off the hook if you breach encrypted health data. (But not necessarily. More on that below.)

I’m not a lawyer, so this isn’t legal advice. Even the HHS, who coin the term “Breach Safe Harbor” in their guidance portal, weasels out of saying they’re giving legal guidance by saying “The contents of this database lack the force and effect of law, except as authorized by law …”

Quality of encryption

You can’t just say that data were encrypted before they were breached. Weak encryption won’t cut it. You have to use acceptable algorithms and procedures.

How can you know whether you’ve encrypted data well enough to be covered Breach Safe Harbor? HHS cites four NIST publications for further guidance. (Not that I’m giving legal advice. I’m merely citing the HHS, who also is not giving legal advice.)

Here are the four publications.

Maybe encryption isn’t enough

At one point Tennessee law said a breach of encrypted data was still a breach. According to Dempsey and Carlin [1]

In 2016, Tennessee repealed its encryption safe harbor, requiring notice of breach of even encrypted data, but then in 2017, after criticism, the state restored a safe harbor for “information that has been encrypted in accordance with the current version of the Federal Information Processing Standard (FIPS) 140-2 if the encryption key has not been acquired by an unauthorized person.”

This is interesting for a couple reasons. First, there is a precedent for requiring notification of encrypted data. Second, this underscores the point above that encryption in general is not sufficient to avoid having to give notice of a breach: standard-compliant encryption is sufficient.

Consulting help

If you would like technical or statistical advice on how to prevent or prepare for a data breach, or how to respond after a data breach after the fact, we can help.

 

[1] Jim Dempsey and John P. Carlin. Cybersecurity Law Fundamentals, Second Edition.

MD5 hash collision example

Marc Stevens gave an example of two alphanumeric strings that differ in only one byte that have the same MD5 hash value. It may seem like beating a dead horse to demonstrate weaknesses in MD5, but it’s instructive to study the flaws of broken methods. And despite the fact that MD5 has been broken for years, lawyers still use it.

The example claims that

TEXTCOLLBYfGiJUETHQ4hAcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak

and

TEXTCOLLBYfGiJUETHQ4hEcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak

have the same hash value.

This raises several questions.

Are these two strings really different, and if so, how do they differ? If you stare at the strings long enough you can see that they do indeed differ by one character. But how could you compare long strings like this in a more automated way?

How could you compute the MD5 hash values of the strings to verify that they are the same?

The following Python code addresses the questions above.

from hashlib import md5
from difflib import ndiff

def showdiff(a, b):
    for i,s in enumerate(ndiff(a, b)):
        if s[0]==' ': continue
        elif s[0]=='-':
            print(u'Delete "{}" from position {}'.format(s[-1],i))
        elif s[0]=='+':
            print(u'Add "{}" to position {}'.format(s[-1],i))    

a = "TEXTCOLLBYfGiJUETHQ4hAcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak"
b = "TEXTCOLLBYfGiJUETHQ4hEcKSMd5zYpgqf1YRDhkmxHkhPWptrkoyz28wnI9V0aHeAuaKnak"

showdiff(a, b)

ahash = md5(a.encode('utf-8')).hexdigest()
bhash = md5(b.encode('utf-8')).hexdigest()
assert(ahash == bhash)

The basis of the showdiff function was from an answer to a question on Stack Overflow.

The output of the call to showdiff is as follows.

Delete "A" from position 21
Add "E" to position 22

This means we can form string b from a by changing the ‘A’ in position 21 to an ‘E’. There was only one difference between the two strings in this example, but showdiff could be useful for understanding more complex differences.

The assert statement passes because both strings hash to faad49866e9498fc1719f5289e7a0269.

Related posts

Unique letter patterns in words

The word Mississippi has a unique pattern of letters. If you were solving a cryptogram puzzle and saw ZVFFVFFVCCV you might guess that the word is Mississippi.

Is the pattern of letters in Mississippi literally unique or just uncommon? What is the shortest word with a unique letter pattern? The longest word?

We can answer these questions by looking at normalized cryptograms, a sort of word signature. These are formed by replacing the first letter in a word with ‘A’, the next unique letter with ‘B’, etc. The normalized cryptogram of Mississippi is ABCCBCCBDDB.

The set of English words is fuzzy, but for my purposes I will take “words” to mean the entries in the dictionary file american-english on my Linux box, removing words that contain apostrophes or triple letters. I computed the cryptogram of each word, then looked for those that only appear once.

Relative to the list of words I used, yes, Mississippi is unique.

The shortest word with a unique cryptogram is eerie. [1]

The longest word with a unique cryptogram is ambidextrously. Every letter in this 14-letter word appears only once.

[1] Update: eerie is a five-letter example, but there are more. Jack Kennedy pointed out amass, llama, and mamma in the comments. I noticed eerie because its cryptogram comes first in alphabetical order.

Emails moved to Substack

Substack icon

Until recently I used two email services: one to send out daily blog post announcements and another for monthly blog highlights. I’ve combined these into one Substack account for weekly blog highlights.

Apparently readers really like this move. Daily and monthly email subscriptions flatlined some time ago, but Substack subscriptions are going up steadily.

Substack is a kind of hybrid of RSS and Twitter. Like RSS, you can subscribe to articles. But like Twitter (and Mastodon, and many other platforms) you can also have a timeline of brief messages, which Substack calls notes.

Only articles trigger an email. You have to use the Substack app to see notes. I think this will work out well. My plan for now is to write a Substack article about once a week with blog highlights, and use notes to announce posts as they come out.

 

Music of the spheres

The idea of “music of the spheres” dates back to the Pythagoreans. They saw an analogy between orbital frequency ratios and musical frequency ratios.

HD 110067 is a star 105 light years away that has six known planets in orbital resonance. The orbital frequencies of the planets are related to each other by small integer ratios.

The planets, starting from the star, are labeled b, c, d, e, f, and g. In 9 “years”, from the perspective of g, the planets complete 54, 36, 24, 16, 12, and 9 orbits respectively. So the ratio of orbital frequencies between each pair of consecutive planets are either 3:2 or 4:3. In musical terms, these ratios are fifths and fourths.

In the chord below, the musical frequency ratios are the same as the orbital frequency ratios in the HD 110067 system.

Here’s what the chord sounds like on a piano:

hd11067.wav

Related posts

Substack replacing email subscription

The service that sent out my email to blog subscribers stopped working a couple weeks ago, and I’m trying out Substack as a replacement. You can find my Substack account here.

My plan for now is to use this account to make blog post announcements, maybe once a week, with a little introductory commentary for each link. I expect to adjust course in response to feedback. Maybe I’ll write some posts just on Substack, but for now I intend to use it as a place to post blog round-ups.

The Stubstack is free and I have no intention to ever charge for it. It’s possible I might make some premium add-on in the future, but I doubt it. If I did, the blog post round-up would remain free.