Uncategorized

How to memorize a Bitcoin address

The latest episode of Darknet Diaries interviews someone using the pseudonym Default. He says in the interview that he had nearly a thousand Bitcoins (about $36 M) in a wallet stored on an external hard drive that was seized by federal agents when they raided his home. Default went to prison for five years for some audacious hacking, but now he’s out and the feds won’t give him his hard drive back.

Default isn’t the only person who has lost a fortune because he lost a number. It’s fairly common.

If Default had memorized his account number(s) he’d be rich today. Memorizing a Bitcoin address or a private key would take some work, but I’d do it for a million dollars. I’d do it for a whole lot less than that because it’s not that much work.

I’ll present two ways to memorize addresses and keys. I’ll start with addresses.

How long is a Bitcoin address?

A Bitcoin address is essentially [1] a 200-bit number encoded in base 58. Why base 58? That’s 10 digits, 26 lower case letters, and 26 upper case letters, with four visually similar characters removed. The characters 1, l, and I are similar, as are 0, o, and O. Base 58 keeps the numeral 1 and the lower case letter o and discards the characters that look similar to these. More on base 58 encoding here.

So you can think of a Bitcoin address as either a 200-bit number or a string of 34 alphanumeric characters.

Base 58

The most direct approach would be to memorize the address in its base 58 form. This is a string of 34 alphanumeric characters. As I mention here, I memorize numbers using the Major system, and letters using the NATO alphabet. To distinguish lower case and upper case I have two variants of each NATO name, a large object for capital letters and a small object for lower case letters. For example, golf cart for G and golf ball for g.

Decimal

A 200-bit number corresponds to a 60-digit number. It’s easier to memorize 60 digits than to memorize 34 alphanumeric characters because the digits can easily be chunked into groups whereas the alphanumeric characters cannot.

You can go back and forth between base 58 encoding of an address and hexadecimal using this online calculator and you can go between hexadecimal and decimal easily in any programming language.

For example, I typed the address

    1MycacnJaSqwwJqjawXBErnLsZ7RkXUAs

into the online calculator and got back

   00F54A5851E9372B87810A8E60CDD2E7CFD80B6E31C7F18FE8

Then I appended 0x to this and typed it into the Python REPL to get a decimal number.

   
    >>> 0x00F54A5851E9372B87810A8E60CDD2E7CFD80B6E31C7F18FE8
    6014503356492732657644518984173176634541310227850984525800

There are two ways to group digits for memorization: regularly and irregularly.

Irregular grouping

The Major system for memorizing numbers is typically presented using irregular grouping. You convert each digit into a consonant sound and improvise a way a grouping the consonants into words. Some words may encode one or two digits, some three digits. You might get lucky occasionally and be able to group four digits into a single word. Then you create mental images that string the words together.

In the example above, you might divide the number into 60 1450 33 … which you could encode as “goose drills mummy …” and maybe you’d imaging a goose as a drill sergeant, barking at a mummy soldier, etc.

Regular grouping

It’s easy to get started with irregular grouping, but regular grouping has its advantages. You might group your 60 digits into 30 pairs or 20 triples. If you’re using two-digit groups, you might have a rule that only the first two consonant sounds count, giving yourself the freedom to use words that have more than two consonant sounds. Of course you could have an analogous rule for three digits.

If you break a 60-digit address into 20 three-digit numbers, and convert each three-digit number to a word, you have to memorize a sequence of 20 words. Choose words that are easy to visualize, and you have to remember a sequence of 20 images. That takes some effort, but people go to greater effort to save Bitcoin addresses.

Private keys

Private keys are little longer than addresses, 256 bits rather than 200. The same principles apply. A 256-bit number corresponds to a 77 or 78 digit number. If you divide 78 digits into groups of (average) length 3, that’s 26 words to memorize. Since 26 happens to be the number of letters in the English alphabet, you could associate each word with a letter of the alphabet, say using the NATO alphabet, to make sure you remember the worlds in the correct order.

Related posts

[1] You can find details here. You start with the output of a 160-bit hashing algorithm, and add 40 bits of checksum and version information. The encoding is not strictly base 58 encoding but rather Base 58 Check, which at its core is base 58 encoding.

Photo by Lukas Eggers on Unsplash

Logistic / Normal approximation

In a recent post I pointed out that a soliton, a solution to the KdV equation, looks a lot like a normal density for fixed x. As someone pointed out in the comments, one way to look at this is that the soliton is exactly proportional to the density of a logistic distribution, and it’s well known that the logistic distribution is approximately normal.

Why?

Why might this approximation be useful?

You might want to approximate a normal distribution by a logistic distribution because the cumulative density function of the latter is an elementary function whereas the CDF of the former is not.

You might want to approximate a logistic distribution by a normal distribution because the normal distribution has nice, well-understood theoretical properties.

How?

If you wanted to approximate a logistic distribution by a normal, or vice versa, how would you do so? How large is the error in the approximation?

This post will answer these questions for four matching methods:

  1. Moment matching
  2. 1-norm
  3. 2-norm
  4. Sup-norm

We will find the value of the logistic scale parameter s that minimizes the distance between the logistic PDF

f(x, s) = sech²(x / 2s) / 4s

and that of the standard normal

g(x) = (2π)−1/2exp( − x²/2 )

by each of the criteria above. We set the scale parameter of the normal to 1 because the ratio of the optimal logistic scale to the normal scale is constant.

Moment matching

Let s be the scale parameter for the logistic distribution and let σ be the scale parameter for the normal distribution. We will assume both distributions have mean 0.

The variance of the logistic is π² s²/3 and the variance of the normal is σ². So moment matching requires

σ = π s / √3

or

s = √3 / π

since we’re setting σ = 1.

plot

How good is this approximation? That depends on how you measure the error, which we will explore below. We will see how it compares to the optimal solution under each criterion.

1-norm

It would be nice to calculate the 1-norm of the difference f(x, s) − g(x) then minimize this as a function of s. But that difference cannot be computed in closed form. At least Mathematica can’t compute it in closed form. So I found the minimum numerically.

plot

Moment matching sets s = 0.5513 and leads to a 1-norm error of 0.1087.

The optimal value of s for the 1-norm is s = 0.6137 which yields an error of 0.0665.

2-norm

plot
With moment matching s = 0.5513 and the 2-norm error is 0.05566.

The optimal value for the 2-norm is s = 0.61476 which yields a 2-norm error of 0.0006973.

sup-norm

The sup norm, a.k.a. min-max norm or ∞ norm, minimizes the maximum distance between the two functions.

plot

When s = 0.5513 the sup norm is 0.0545.

The optimal value of s for the sup norm is 0.6237 and yields a sup norm error of 0.01845.

Conclusion

We can improve on moment matching, for all three norms simultaneously, by using a larger value of s, such as 0.61.

If you have a normal(μ, σ) distribution and you want to approximate it by a logistic distribution, set the mean of the latter to μ and the scale to 0.61σ. If you care about a particular error measure, use the corresponding multiplier rather than 0.61.

If you want to approximate a logistic with mean μ and scale s by a normal, set the mean of the normal to μ and set σ = s/0.61.

Blog RSS feed

I got an email from someone saying the RSS feed for this site stopped working. Anyone else having this problem?

I subscribe to my RSS feed and it’s working fine for me. It may be that there are variations on the RSS feed, and the version I’m using works while the variation some others use is not working.

I’m subscribed to

http://www.johndcook.com/blog/feed/

and that works as far as I know.

Update: The problem may have something to do with the Firefox Livemarks plugin. If you use a different RSS reader and have been having problems, please let me know.

Update: I removed one non-ASCII character from the site and that fixed at least one person’s problem with the RSS feed.

Update: Someone said changing http to https in the feed URL fixed their problem.

By the way, if you’re a blogger, I highly recommend subscribing to your own feed. Otherwise you may not know of problems. I have emailed multiple bloggers to tell them of problems with their RSS feed that they were unaware of, such as displaying LaTeX source code rather than rendered LaTeX images, for example.

***

While you’re here, let me remind you that you can find me on Mathstodon and on the platform formerly known as Twitter.

You can also subscribe to the blog or to my monthly newsletter via email.

Executive order on differential privacy

US whitehouse

This week President Biden signed a long, technically detailed executive order (Executive Order 14110) that among other things requires the Secretary of Commerce to look into differential privacy.

Within 365 days of the date of this order … the Secretary of Commerce … shall create guidelines for agencies to evaluate the efficacy of differential-privacy-guarantee protections, including for AI. The guidelines shall, at a minimum, describe the significant factors that bear on differential-privacy safeguards and common risks to realizing differential privacy in practice.

I doubt many people have read this order. Print preview on my laptop said it would take 64 pages to print. Those brave souls who try to read it will find technical terms like differential privacy that they likely do not understand.

So just what is differential privacy? A technical definition involves bounds on ratios of cumulative probability distributions functions, not the kind of thing you usually see in newspapers, or in executive orders.

I’ll give a layman’s overview here. If you’d like to look at the mathematical details, see this post for a gentle introduction. And if you want even more math, see this post.

What is differential privacy?

The basic idea behind differential privacy is to protect the privacy of individuals represented in a database by limiting the degree to which each person’s presence in the database can impact queries of the database.

A calibrated amount of randomness is added to the result of each query. The amount of randomness is proportional to the sensitivity of the query. For innocuous queries the amount of added randomness may be very small, maybe even less than the amount of uncertainty inherent in the data. But if you ask a query that risks revealing information about an individual, the amount of added randomness increases, possibly increasing so much that the result is meaningless.

Each question (database query) potentially reveals some information about a person, and so a privacy budget keeps track of the queries a person has posed. Once you’ve used up your privacy budget, you’re not allowed to ask any more questions. Otherwise you could ask the same question (or closely related questions) over and over, then average your results to essentially remove the randomness that was added.

Pragmatic matters

Differential privacy is great in theory, and possibly in practice too. But the practicality depends a great deal on context. For example, exactly how much noise is added to query results? That depends on the level of privacy you want to achieve, usually denoted by a parameter ε. Smaller values of ε provide more privacy, and larger values provide less.

How big should ε be? There is no generic answer. The size of ε must depend on context. Set ε too small and the utility of the data vanishes. Set ε too high and there’s effectively no privacy protection.

US Census Bureau

Biden’s executive order isn’t the US government forey into differential privacy. The US Census Bureau used differential privacy on results released from the 2020 census. This means that the reported results are deliberately not accurate, though hopefully the results are accurate enough, with only the minimum amount of inaccuracy injected as necessary to preserve privacy. Opinions are divided on whether that was the case. Some researchers have complained that the results were too noisy for the data they care about. The Census Bureau could reply “Sorry, but we gave you the best results we could while adhering to our privacy framework.”

Implementing differential privacy at the scale of the US Census took an enormous amount of work. The census serves as a case study that would allow other government agencies to have an idea of what they’re getting into.

Pros and cons of differential privacy

Differential privacy rests on a solid mathematical foundation. While this means that it provides strong privacy guarantees (if implemented correctly), it also means that it takes some effort to understand. Differential privacy opens up new possibilities but requires new ways of working.

If you’d like help understanding how your company could take advantage of differential privacy, or minimize the disruption of being required to implement differential privacy, let’s talk.

Related posts

Country and language abbreviations

I recently had to mark a bit of German text as German in an HTML file and I wondered whether the abbreviation might be GER for German, or DEU for deutsche.

Turns out the answer is both, almost. The language abbreviations used for HTML microdata are given in ISO 639, and they come in three-letter and two-letter varieties. The three-letter abbreviation for German is GER but the two-letter abbreviation is DE.

There are also standard two- and three-letter abbreviations for countries, given in ISO 3166. These are DE and DEU for Germany. I was curious how often a country abbreviation is also a language abbreviation.

I found text files giving the ISO 639 and ISO 3166 abbreviations, and used the comm utility to see how what the intersection was.

There are 253 languages and 252 countries in the two standards. There are 110 two-letter abbreviations common to both, and 40 three-letter abbreviations common to both.

However, just because an abbreviation appears in both standards, this doesn’t mean it represents the same thing in both standards. Sometimes they overlap. For example CZE abbreviates both the Czech Republic and the Czech language. But BEL represents the nation of Belgium and the Belarusian language.

Time difference

A simple question sent me down a rabbit hole this morning: what is the time difference between Houston and London?

At the moment the difference is six hours. But how will that change when Daylight Saving Time ends this year. Wait a minute, will Daylight Saving Time end this year?

I wasn’t even sure whether we were on DST until I asked my wife a few days ago. “Spring forward, Fall back. It’s been Fall for a month now. Did we fall back?”

The US Senate passed the “Sunshine Protection Act” last year that would have eliminated changing times (Yay!) by permanently staying on DST (Boo!). Whatever happened to that? Are we about to fall back or not?

Turns out the House of Representatives never passed the bill, so  things will stay as they have been. Apparently the Senate didn’t seriously consider the Sunshine Protection Act. According to Wikipedia,

In 2022, the Senate passed the bill by unanimous consent, although several senators stated later that they would have objected if they had known that the bill could pass.

OK, so when does the time change in the US? It’s always a Sunday, but which one? Again according to Wikipedia,

Since 2007, in areas of Canada and the United States in which it is used, daylight saving time begins on the second Sunday of March and ends on the first Sunday of November.

Alright, now what about London? Europe has “Summer Time,” which is the same idea as Daylight Saving Time. Summer Time ends on the last Sunday in October.

At the moment, London is 6 hours ahead of Houston. This Sunday, October 29, London falls back an hour and the difference with Houston will be 5 hours. Then the following week it goes back to 6 hours.

In the process of looking into this, I found out that between 1941 and 1945, and again 1947, there was something call British Double Summer Time in which clocks sprung ahead two hours.

Time zones are surprisingly complicated, though mostly for good reasons. I would not recommend abolishing time zones (except for anything done on a computer, in which using UTC internally is the only way to go). But Daylight Saving Time / Summer Time is ridiculous. It made more sense years ago than it does now. Now there is more variety in individual work schedules, and there is more need to coordinate with people outside your time zone.

Lessons from Skylab

Skylab 4

I discovered the Space Rocket History Podcast a while back and listened to all the episodes on the Apollo program. I’m now listening to the episodes on Skylab as they come out. I came for Apollo; I stayed for Skylab. I would not have sought out the episodes on Skylab, and that would have been a shame.

Skylab is an underrated program. I wonder how many in my children’s generation even know what Skylab was. Those who do probably think of it as brief interlude between Apollo and the Space Shuttles, or more cynically, a way to keep aerospace contractors in business after the Apollo program ended. But NASA learned a lot during the Skylab program.

The numbering of the Skylab missions is a little confusing. Skylab 1 was the mission to put the lab itself into orbit. The first manned mission to the lab was Skylab 2, the second manned mission was Skylab 3, and the final manned mission was Skylab 4.

Skylab 2 was the most dramatic Skylab mission, just as Apollo 13 was the most dramatic Apollo mission, because in both cases there were major problems to be solved. The missions were literally dramatic in that they had the story arc of a drama.

Because Skylab 2 was primarily concerned with repairs, Skylab 3 was the first mission entirely focused on planned mission objectives. The productivity of the crew started slow and steadily accelerated over the course of the mission. I recommend the Space Rocket History Podcast for the details.

NASA initially scheduled the crew of Skylab 4 to work at the pace Skylab 3 achieved by the end of that mission. This lead to tension, mistakes, and falling behind schedule. Ground control and the astronauts worked through their mutual frustrations and came up with better ways of working together. The crew then exceeded the ground crew’s initial expectations. One change was to distinguish synchronous and asynchronous tasks, letting the crew decide when to do tasks that did not have to be done at a particular point in orbit. Another change was to allow the crew more rest: the crew accomplished more by working less (and making fewer mistakes).

This would make a good business school case study. In fact there was a Harvard Business School case study about Skylab 4, but it was based on false information and presumably drew the wrong lessons from the mission. The case study said the crew went on strike for a day, rebelling against ground control. The truth was that ground control lost radio contact with Skylab for one orbit, 90 minutes, due to an error.

Related posts

The 19th rule of HIPAA Safe Harbor

The HIPAA Safe Harbor provision says that data can be considered deidentified if 18 kinds of data are removed or reported at low resolution. At the end of the list of 18 items, there is an extra category, sometimes informally called the 19th rule:

The covered entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.

So if you otherwise meet the letter of the Safe Harbor provision, but you know (or should know) that the data can still be used to identify people represented in the data, then Safe Harbor does not apply.

The Department of Health and Human Services guidance document gives four examples of “when a covered entity would fail to meet the ‘actual knowledge’ provision.” The first concerns a medical record that would reveal someone’s identity by revealing their profession.

Revealing that someone is a plumber would probably not be a privacy risk, but in the HHS example someone’s occupation was listed as a former state university president. If you know what state this person is in, that greatly narrows down the list possibilities. One more detail, such as age, might be enough to uniquely identify this person.

Free text fields, such as physician notes, could easily contain this kind of information. Software that removes obvious names won’t catch this kind of privacy leak.

Not only are intentional free text fields a problem, so are unintentional free text fields. For example, a database field labeled CASENOTES is probably intended to contain free text. But other text fields, particularly if they are wider than necessary to contain the anticipated data, could contain identifiable information.

If you have data that does not fall under the Safe Harbor provision, or if you are not sure the Safe Harbor rules are enough to insure that the data are actually deidentified, let’s talk.

Related posts

This post is not legal advice. My clients are often lawyers, but I am not a lawyer.

Bluesky

I saw a comment from Christos Argyropoulos on Twitter implying that there’s a good scientific community on Bluesky, so I went there and looked around a little bit. I have account, but I haven’t done much with it. I was surprised that a fair number of people had followed me on Bluesky even though I only posted twice. I posted a couple links this evening, doubling my total activity on Bluesky.

I don’t know what I’ll do with my Bluesky or Mastodon accounts. I certainly will not try to replicate what I built on Twitter. So far I’m more of a reader than a writer on Bluesky and Mastodon. Bluesky is not a science-focused social network, but I may use it for that, only following science-oriented accounts there. We’ll see.

You can always find me here, whether or not you can find me on Bluesky, Mastodon, or Twitter. You can subscribe to this site to get notifications of new posts via RSS or email, and I also have a monthly newsletter where I post blog highlights.