Unique letter patterns in words

Posted on 9 March 2024 by John

The word Mississippi has a unique pattern of letters. If you were solving a cryptogram puzzle and saw ZVFFVFFVCCV you might guess that the word is Mississippi.

Is the pattern of letters in Mississippi literally unique or just uncommon? What is the shortest word with a unique letter pattern? The longest word?

We can answer these questions by looking at normalized cryptograms, a sort of word signature. These are formed by replacing the first letter in a word with ‘A’, the next unique letter with ‘B’, etc. The normalized cryptogram of Mississippi is ABCCBCCBDDB.

The set of English words is fuzzy, but for my purposes I will take “words” to mean the entries in the dictionary file american-english on my Linux box, removing words that contain apostrophes or triple letters. I computed the cryptogram of each word, then looked for those that only appear once.

Relative to the list of words I used, yes, Mississippi is unique.

The shortest word with a unique cryptogram is eerie. [1]

The longest word with a unique cryptogram is ambidextrously. Every letter in this 14-letter word appears only once.

[1] Update: eerie is a five-letter example, but there are more. Jack Kennedy pointed out amass, llama, and mamma in the comments. I noticed eerie because its cryptogram comes first in alphabetical order.

Normal subgroups are subtle

Posted on 5 March 2024 by John

The definition of a subgroup is obvious, but the definition of a normal subgroup is subtle.

Widgets and subwidgets

The general pattern of widgets and subwidgets is that a widget is a set with some kind of structure, and a subwidget is a subset that has the same structure. This applies to vector spaces and subspaces, manifolds and submanifolds, lattices and sublattices, etc. Once you know the definition of a group, you can guess the definition of a subgroup.

But the definition of a normal subgroup is not something anyone would guess immediately after learning the definition of a group. The definition is not difficult, but its motivation isn’t obvious.

Standard definition

A subgroup H of a group G is a normal subgroup if for every g ∈ G,

g⁻¹Hg = H.

That is, if h is an element of H, g⁻¹hg is also an element of H. All subgroups of an Abelian group are normal because not only is g⁻¹hg also an element of H, it’s the same element of H, i.e. g⁻¹hg = h.

Alternative definition

There’s an equivalent definition of normal subgroup that I only ran across recently in a paper by Francis Masat [1]. A subgroup H of a group G is normal if for every pair of elements a and b such that ab is in H, ba is also in H. With this definition it’s obvious that every subgroup of an Abelian group is normal because ab = ba for any a and b.

It’s an easy exercise to show that Masat’s definition is equivalent to the usual definition. Masat’s definition seems a little more motivated. It’s requiring some vestige of commutativity. It says that a subgroup H of a non-Abelian group G has some structure in common with subgroups of normal groups if this weak replacement for commutativity holds.

Emails moved to Substack

Posted on 2 March 2024 by John

Substack icon

Until recently I used two email services: one to send out daily blog post announcements and another for monthly blog highlights. I’ve combined these into one Substack account for weekly blog highlights.

Apparently readers really like this move. Daily and monthly email subscriptions flatlined some time ago, but Substack subscriptions are going up steadily.

Substack is a kind of hybrid of RSS and Twitter. Like RSS, you can subscribe to articles. But like Twitter (and Mastodon, and many other platforms) you can also have a timeline of brief messages, which Substack calls notes.

Only articles trigger an email. You have to use the Substack app to see notes. I think this will work out well. My plan for now is to write a Substack article about once a week with blog highlights, and use notes to announce posts as they come out.

Music of the spheres

Posted on 28 February 2024 by John

The idea of “music of the spheres” dates back to the Pythagoreans. They saw an analogy between orbital frequency ratios and musical frequency ratios.

HD 110067 is a star 105 light years away that has six known planets in orbital resonance. The orbital frequencies of the planets are related to each other by small integer ratios.

The planets, starting from the star, are labeled b, c, d, e, f, and g. In 9 “years”, from the perspective of g, the planets complete 54, 36, 24, 16, 12, and 9 orbits respectively. So the ratio of orbital frequencies between each pair of consecutive planets are either 3:2 or 4:3. In musical terms, these ratios are fifths and fourths.

In the chord below, the musical frequency ratios are the same as the orbital frequency rations in the HD 110067 system.

Here’s what the chord sounds like on a piano:

hd11067.wav

Substack replacing email subscription

Posted on 27 February 2024 by John

The service that sent out my email to blog subscribers stopped working a couple weeks ago, and I’m trying out Substack as a replacement. You can find my Substack account here.

My plan for now is to use this account to make blog post announcements, maybe once a week, with a little introductory commentary for each link. I expect to adjust course in response to feedback. Maybe I’ll write some posts just on Substack, but for now I intend to use it as a place to post blog round-ups.

The Stubstack is free and I have no intention to ever charge for it. It’s possible I might make some premium add-on in the future, but I doubt it. If I did, the blog post round-up would remain free.

A very accurate logarithm approximation

Posted on 24 February 2024 by John

The previous post looked at an efficient way to approximate nth roots of fractions near 1 by hand. This post does the same for logarithms.

As before, we assume x = p/q and define

s = p + q
d = p − q

Because we’re interested in values of x near 1, d is small, and small numbers are convenient to work with by hand.

In [1] Kellogg gives the approximation

log x ≈ 3(x² − 1)/((x+ 1)² + 2x) = 6ds/(3s² − d²)

So, for example, suppose we wanted to take the natural log of 7/8. then p = 7, q = 8, s = 15, and d = −1.

log x ≈ (6×15×(−1))/(3×225 − 1) = − 90/674 = − 45/337.

This approximation is good to six decimal places.

Kellogg claims that

This value of E [the natural logarithm], if q [what I’ve called x] be between .9 and 1.1, is true to the seventh decimal.

He then goes on to explain how to create an even more accurate approximation, and how to deal with larger values of x.

Here’s a plot verifying Kellogg’s claim.

Note the that scale of the plot is 10⁻⁸. As the flat spot in the middle suggests, you get even more decimal places for x closer to 1.

[1] Ansel N. Kellogg. Empirical formulæ; for Approximate Computation. The American Mathematical Monthly. February 1987, Vol. 4 No. 2, pp. 39–49.

Almost ASCII

Posted on 23 February 2024 by John

I was working recently with a gigabyte file that had a dozen non-ASCII characters. This is very common. The ASCII character set is not quite big enough for a lot of tasks. Of course it’s completely inadequate if you’re writing Japanese, but it’s almost enough for documents written in English and a few other languages.

Efficient encoding

The world has standardized on Unicode as the way to represent characters across languages. Unicode currently has around 150,000 characters, far more than ASCII’s 128 characters.

But there’s a problem. Since 150,000 > 2¹⁷, it takes more than two bytes (eight bits to a byte) to represent each of 150,000 things. If you use three bytes to represent each character, every file that is almost all ASCII will get three times bigger. If you limit yourself to the most frequently used Unicode characters, those that can be represented with two bytes (the “basic multilingual plane”), then you still double the size of files.

Enter UTF-8, a brilliant solution to this problem. The UTF-8 encoding of an ASCII file is an ASCII file. Pure ASCII files don’t get any larger when interpreted as UTF-8 encoded Unicode. Because 128 = 2⁷, a byte representing an ASCII character has one unused bit. UTF-8 uses this unused bit to signal that what follows is not ASCII. I wrote about the full details here.

Unicode characters outside the ASCII range take 2, 3, or 4 bytes to represent. Inserting a small number of non-ASCII characters into a UTF-8 encoded Unicode file hardly changes the file’s size.

Troubleshooting

I mentioned at the top that I had a gigabyte file with a dozen non-ASCII characters. The command file -I reported the file encoding to be ASCII, because the vast majority of the file was ASCII. But the non-ASCII characters were not valid Unicode characters either.

These invalid Unicode characters would display as �, which is not actually in the file. The � is a valid Unicode character for representing an invalid Unicode character.

Some of the non-ASCII characters where extended ASCII (Windows 1252) characters, but if I remember correctly even that didn’t account for everything. Some of the odd characters were simply file corruption.

It’s kinda interesting how some tools are robust to these kinds of glitches and some are not. My first clue that something funny was going on was when sort refused to sort. I ran a Python script that helps me fix wonky text files and it threw an error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 222662: invalid start byte

This may seem like gibberish, but it actually says exactly what’s going on. There was an error interpreting the file as Unicode, because 0x92 is not a valid way to start a non-ASCII character in UTF-8.

The first bit of an ASCII character is 0. The first two bits of a non-ASCII character in UTF-8 are 11. But 9 is 1001 in binary, i.e. it starts with 10, and so the byte 0x92 is neither an ASCII character nor the beginning of a UTF-8 non-ASCII sequence of bytes. More details here.

Removing non-ASCII characters

For my application I could just remove the invalid characters using iconv with the -c option.

iconv -c -f CP1252 -t UTF-8 inputfile > outputfile

If you need to salvage troublesome characters then things are a little more complicated. The iconv utility will work if you know what the intended encoding was. If you don’t the intended encoding, you may need to do some detective work.

Additive functions

Posted on 20 February 2024 by John

A function f from positive integers to real numbers is defined to be additive if for relatively prime numbers m and n,

f(mn) = f(m) + f(n).

The function f is called completely addititive if the above holds for all positive integers m and n, i.e. we drop the requirement that m and n are relatively prime.

Example: total prime factors

One example of an additive function is the function Ω(n) defined to be the number of prime factors of n, counted with multiplicity. For example, Ω(12) = 3 because 12 = 2 × 2 × 3. The numbers 10 and 63 are relatively prime, and

Ω(630) = 5 = Ω(10) + Ω(63).

Example: distinct prime factors

Another example of an additive function is ω(n) defined to be the number of distinct prime factors of n, i.e. not counting with multiplicity. So, for example, ω(12) = 2.

This function is additive but not completely additive because, for example,

ω(20) = 2 ≠ ω(2) + ω(10) = 3

A theorem of Erdős

Here is a remarkable theorem due to Paul Erdős [1]. Suppose f is an additive function such that

f(n + 1) − f(n)

converges to zero as n goes to infinity. Then

f(n) = c log(n)

for some constant c. And since a multiple of a logarithm is a logarithm to a different base, we can restate the conclusion by simply saying f is a logarithm.

Logarithms are completely additive functions, so even though we only assumed f was additive, this combined with the limit condition proves that in fact f is completely additive.

[1] Paul Erdős, “On the distribution function of additive functions,” Ann. of Math., Vol. 47 (1946), pp. 1–20.

Security by obscurity

Posted on 15 February 2024 by John

Security-by-obscurity is a bad idea in general. It’s better, for example, to have a login page than to give your site an obscure URL. It’s better to encrypt a file than to hide it in some odd directory. It’s better to use a well-vetted encryption algorithm than to roll your own.

There there are people whose knee-jerk reaction to any form of obscurity is to shout “That’s security-by-obscurity!” but obscurity can be subtle.

All else being equal, adding a layer of obscurity doesn’t hurt. For example, you can literally make a public encryption key public, as I’ve done here. But for extra security, why distribute your encryption key more widely than necessary? And if your message is adequately encrypted, you could in principle publish it for the world to see. But why not just give it to the intended recipient?

The public key on my site is there for strangers to contact me, but if I were really concerned about secure communication between colleagues, I’d just circulate the key among those colleagues. That may not be much more secure, but surely it’s no less secure. And I’d share messages privately, even though they are encrypted.

It’s good to look closely at any argument that beings “all else being equal” to see if all else is indeed equal. A more nuanced objection to security-by-obscurity is that it can create a false sense of security.

One could argue, for example, that making your public key available to the world forces you to be more careful about your encryption. Maybe you’ve been using an RSA key for years, and you really should use a longer key, but you don’t because you can argue that not many people have your public key anyway. But if your key’s too sort, obscuring your public key doesn’t help.

And while it’s better to deliver encrypted messages privately, it helps to not count on this, to assume that the encrypted message might be made public. That’s the basic premise behind encryption.

The principle behind no-security-by-obscurity is that you want to concentrate your security where it can be quantified. You can, for example, quantify how much more effort it would take to break a 64-bit key (like Blowfish) than a 56-bit key (like DES). Or even better, a 128-bit key (like AES). But you can’t quantify the level of protection that comes from obscurity.

Is it more secure to give someone a 56-bit DES key on a flash drive in a dark alley than to send them a 64-bit Blowfish key over SMS You can’t calculate an answer to that question.

In some sense all security is by obscurity. Cryptography literally means hidden writing. But all else being equal—there’s that phrase again—you want to minimize the surface area of what you have to obscure, e.g. limiting your secret to your key and not your algorithm, and it’s better to have quantified risks than unquantified risks. But all else is often not equal, and there are difficult trade-offs.

How much metadata is in a photo?

Posted on 13 February 2024 by John

A few days ago I wrote about the privacy implications of metadata in a PDF. This post will do the same for photos.

Dalek on a Seattle train

You can see the metadata in a photo using exiftool. By default cameras include time and location data. I ran this tool on a photo I took in Seattle a few years ago when I was doing some work for Amazon. The tool reported 114 fields, some of which are redundant. Here is some of the information contained in the metadata.

GPS Altitude  : 72.5 m Above Sea Level
GPS Date/Time : 2017:05:05 17:47:33.31Z
GPS Position  : 47 deg 36' 39.71" N, 122 deg 19' 59.40" W
Lens ID       : iPhone SE back camera 4.15mm f/2.2

How finely does this specify the location? The coordinates are given to 1/100 of a second, so 1/360000 of a degree. A degree of latitude is 111 km, so the implied accuracy is on the order of 30 cm or one foot, whether that’s correct or not.

You can look up that ground level at that location is 46 meters above sea level, which would imply the photo was taken on the 8th floor of a building. (It clearly wasn’t. Either the elevation of ground level or the elevation recorded in the phone isn’t correct.)

When I cropped the image, the edited image contained the software and operating system that was used to edit it.

Platform    : Linux
Software    : GIMP 2.10.30
Modify Date : 2024:02:13 08:39:49

This shows that I edited the image this morning using GIMP installed on a Linux box.

You can change your phone’s settings to not include location data in photos. If you do, the photos may still include the time zone, which is a weak form of location data. You can remove some or all the metadata later using image editing software, but by default a photo reveals more than you may intend.

Uncategorized

Unique letter patterns in words

Emails moved to Substack

Music of the spheres

Related posts

Substack replacing email subscription

A very accurate logarithm approximation

Almost ASCII

Efficient encoding

Troubleshooting

Removing non-ASCII characters

Related posts

Additive functions

Example: total prime factors

Example: distinct prime factors

A theorem of Erdős

Related posts

Security by obscurity

Related posts

How much metadata is in a photo?

More metadata posts

Related posts

Widgets and subwidgets

Standard definition

Alternative definition

Categories

Related posts

Related posts

Efficient encoding

Troubleshooting

Removing non-ASCII characters

Related posts

Example: total prime factors

Example: distinct prime factors

A theorem of Erdős

Related posts

Related posts

More metadata posts

Related posts