Code to convert words to Major system numbers

A few days ago I wrote about using the CMU Pronouncing Dictionary to search for words that decode to certain numbers in the Major mnemonic system. You can find a brief description of the Major system in that post.

As large as the CMU dictionary is, it did not contain words mapping to some three-digit numbers, so it would be good to explore a larger, or at least different, dictionary. But the CMU dictionary is apparently the largest dictionary with pronunciation openly available.

To get more pronunciation data, you’ll need to generate it. This is what linguists call the grapheme to phoneme problem. There are software packages that create phonetic spellings using large neural network models, including models trained on the CMU data.

Why quick-and-dirty is OK

However, it’s possible to do a good enough job with much simpler software. There are several reasons why we don’t need the sophistication of research software. First and foremost, we can tolerate errors. If we get a few false positives, we can skim through those and ignore them. And if we get a few false negatives, that’s OK as long as we find a few of the words we’re looking for.

Another thing in our favor is that we’re not looking for pronunciation per se, only the numbers generated from the pronunciation. The hardest part of the grapheme to phoneme problem is vowel sounds, and we don’t care about vowel sounds at all. And we don’t care about distinguishing, for example, between voiced and unvoiced variations on the th sound because they both map to 1.

Code

The Major mnemonic system is based on pronunciation, not spelling. Nevertheless, you can do a rough-and-ready conversion, adequate for our purposes, based on spelling. I take into account a minimal amount of context, such as noting that c is soft before i, e, and y, but hard before a, o, and u. The handling of ch is probably biggest source of errors because the sound of ch depends on etymology.

I wrote this as a Python script initially because I wanted to share it with someone who knows Python. But I’ll present it here in Perl because the Perl code is much more compact.

sub word2num {
    local $_ = shift;
    
    tr/A-Z/a-z/; # lower case
    
    s/ng/n/g;
    s/sch/j/g;
    s/che/k/g;
    s/[cs]h/j/g;
    s/g[iey]/j/g; # soft g -> j
    s/c[eiy]/s/g; # soft c -> s
    s/c[aou]/k/g; # hard c -> k
    s/ph/f/g;
    s/([bflmprv])\1+/\1/g; # condense double letters
    s/qu/k/g;
    s/x/ks/g;

    tr/szdnmrljgkfvpb/00123456778899/;
    tr/a-z//d; # remove remaining letters
    
    return $_
}

Perl has implicit variables, for better and for worse, and here it’s for the better. All the translation (tr//) and substitution (s//) operate in place on the implicit argument, the word sent to the function.

The corresponding Python code is more verbose:

def word2num(w):

    w = w.lower()

    w = w.replace('ng', 'n')
    w = w.replace('sch', 'j')
    w = w.replace('che', 'k')

    for x in ['gi', 'ge', 'gy', 'ch', 'sh']:
        w = w.replace(x, 'j')

    ...

The order of the replacement statements matters. For example, you want to decide whether c and g are hard before you discard the vowels.

This script works better than I expected it would for being such a dirty hack. I ran it on some large word lists looking for more alternatives to the three-digit numbers not in the output of the script processing the CMU dictionary. I list a few of the words I found here. The most amusing find was phobophobia, the fear of phobias, for 898.

Aside from filling in gaps in three-digit numbers, you could also use a script like this to search for mnemonic words in specialized lists of words, such as baseball players, or animal species, or brand names.

One thought on “Code to convert words to Major system numbers

  1. `local $_ = shift;` might be better. Without `local`, that line overwrites the contents of the global `$_` which might be in use by whatever calls the subroutine. A `local` causes the previous contents to be restored after the `sub` `return`s.

    A good habit even if it’s not a problem in this short code.

Comments are closed.