Generating and inspecting an RSA private key

Posted on 5 August 2023 by John

In principle you generate an RSA key by finding two large prime numbers, p and q, and computing n = pq. You could, for example, generate random numbers by rolling dice, then type the numbers into Mathematica to test each for primality until you find a couple prime numbers of the right size.

In practice you’d use a specialized program to find the primes and to wrap everything up in a format that software using the keys can understand. There are a lot of layers between the numbers p and q and the file that key generating software produces, and this post aims to peel back these layers a bit.

Here’s an example of generating a private key taken from The OpenSSL Cookbook.

    openssl genpkey -out fd.key -algorithm RSA \
      -pkeyopt rsa_keygen_bits:2048 -aes-128-cbc

The genpkey function can be used for generating several kinds of public keys. The option -algorithm RSA tells it that we want an RSA key, but we could have asked for an elliptic curve key. As noted in the previous post, in practice public key encryption is used to transfer symmetric encryption keys, not messages per se. The flag -aes-128-cbc tells the software the we’d like to use AES encryption with a 128-bit key in CBC (cipher block chaining) mode.

When you press enter you’ll see a flurry of dots and plus signs that show the progress of the software in generating and testing candidates for the primes p and q. Then you’ll be prompted for a password to encrypt the private key you’ve just created.

If you open the fd.key file you won’t see much:

    % cat fd.key
    -----BEGIN ENCRYPTED PRIVATE KEY-----
    MIIFLTBXBgkqhkiG9w0BBQ0wSjApBgkqhkiG9w0BBQwwHAQIdCZSKfkqh6kCAggA
    MAwGCCqGSIb3DQIJBQAwHQYJYIZIAWUDBAECBBAqbtHXkZ+uqa3rvj6qKqbRBIIE
    ...
    U6QCPcWukFyUAghHdTfjKgoAEXfOEunALoaTF6LMPsd6
    -----END ENCRYPTED PRIVATE KEY-----

This is just base 64-encoded data.

The data is encoded in two senses. It is encoded in a non-secret way, expressed in a standardized data structure, then encoded in the sense of being encrypted. The openssl command pkey will undo both levels of encoding to let us see the contents of the file.

Here’s what this produces.

    Private-Key: (2048 bit, 2 primes)
    modulus:
        00:a7:b8:39:80:0b:18:d9:db:c1:a3:c1:3a:92:89:
        ...
        7a:c5
    publicExponent: 65537 (0x10001)
    ...
    prime1:
        00:dc:8c:27:e6:7f:1c:11:d4:9c:8c:33:bf:07:57:
        ...
        97:5f:8c:4c:44:23:d2:85:f9
    prime2:
        00:c2:ae:20:80:87:da:d0:a1:66:8f:2e:90:7c:ae:
        ...
        9c:e9:8a:8b:bc:c7:71:de:2d
    ...

The exponent is the default value 65537. (More on that here.)

The large numbers are displayed in hexadecimal with colons separating pairs of hex digits. If you remove the colons and concatenate everything together, you can verify that the number called modulus is indeed the product of the numbers called prime1 and prime2. I verified this for the output above using a little Python code:

    modulus = 0xa7b839...c5
    prime1  = 0xdc8c27...f9
    prime2  = 0xc2ae20...2d
    assert(prime1*prime2 == modulus)

The file also contains four numbers that require more explanation: privateExponent, exponent1, exponent2, and coefficient. The privateExponent is described here. The remaining numbers are not strictly necessary for RSA but are used in Garner’s algorithm for more efficient decryption.

Date sequence from the command line

Posted on 3 August 2023 by John

I was looking back at Jeroen Janssen’s book Data Science at the Command Line and his dseq utility caught my eye. This utility prints out a sequence of dates relative to the current date. I’ve needed this and didn’t know it.

Suppose you have a CSV file and you need to add a column of dates as the first column. I’d probably open a spreadsheet, create a column of the dates I needed, then open the CSV file and paste in the column of dates.

With Jeroen’s utility I could run

    dseq 5 | paste -d, - foo.csv

to create the same sequence of dates and add them as the first column of the file foo.csv. The option -d, tells paste to use a comma as the field separator rather than the default tab. The dash tells paste to use the piped output from dseq as its first input file.

You can run dseq three ways. With one argument, such as the 5 above, it returns the next five days from today (starting with tomorrow). With two arguments, the first is the beginning and the second is the end. With three arguments, the middle argument is an increment. As the source file summarizes it:

    # Usage: dseq LAST
    #    or: dseq FIRST LAST
    #    or: dseq FIRST INCREMENT LAST

If you just want to use dseq, grab it here. If you’d like to understand how dseq is implemented, maybe in order to modify it, keep reading.

How it works

The code is a clever one-liner:

    seq -f "%g day" "$@" | date --file - +%F

The source file has 17 lines: a shebang, several lines of documentation, and one line of code.

The one-liner starts with seq, a builtin utility that produces a sequence of integers. Like many command line utilities, seq is trivial, but it composes nicely with other utilities. And so it can be used in a pipeline to create useful scripts, as it does above.

The argument "$@" simply passes on the arguments of the script calling seq as arguments to seq. So the arguments of dseq become the arguments to seq.

The rest of the call to seq is formatting. It tells seq to append ” day” after each number. The command

    seq -f "%g day" 5

produces

    1 day
    2 day
    3 day
    4 day
    5 day

This creates strings which the date utility will interpret.

The command

    date -d "1 day"

returns the date one day from now. It includes the time, so it’s the date and time 24 hours from now.

The command

    date -d "1 day" +%F

uses the format string +%F to format the date like YYYY-MM-DD, chopping off the time.

The date option --file says to take a file as input and process each line as if it were passed into date with the -d option. The dash option says to use standard input as the file, just as the example with the paste command above used dash to signify standard input, i.e. the output of the command to the left of the pipe symbol.

Note that this script works with the GNU coreutils implementation of date. It does not work, for example, with the version of date that ships with MacOS.

Named entity recognition

Posted on 27 July 2023 by John

Named entity recognition (NER) is a task of natural language processing: pull out named things text. It sounds like trivial at first. Just create a giant list of named things and compare against that.

But suppose, for example, University of Texas is on your list. If Texas is also on your list, do you report that you have a named entity inside a named entity? And how do you handle The University of Texas? Do you put it on your list as well? What about UT? Can you tell from context whether UT stands for University of Texas, University of Toronto, or the state of Utah?

Searching for Rice University would be even more fun. The original name of the school was The William Marsh Rice Institute for the Advancement of Letters, Science, and Art. I don’t know whether the name was ever officially changed. A friend who went to Rice told me they had a ridiculous cheer that spelled out every letter in the full name. And of course rice could refer to a grain.

Let’s see what happens when we run the following sentence through spaCy looking for named entities.

Researchers from the University of Texas at Austin organized a picleball game with their colleagues from Rice University on Tuesday.

I deliberately did not capitalize the definite article in front of University of Texas because I suspected spaCy might include the article if it were capitalized but not otherwise. It included the article in either case.

The results depend on the language model used. When I used en_core_web_trf it included at Austin as part of the university name.

When I used the smaller en_core_web_sm model it pulled out Austin as a separate entity.

The tag ORG stands for organization and DATE obviously stands for date. GPE is a little less obvious, standing for geopolitical entity.

When I changed Rice University to simply Rice, spaCy still recognized Rice as an organization. When I changed it to rice with no capitalization, it did not recognize it as an organization.

The other day I stress tested spaCy by giving it some text from Chaucer’s Canterbury Tales. Even though spaCy is trained on Modern English, it did better than I would have expected on Middle English.

Using the en_core_web_trf model it recognizes Engelond and Caunterbury as cities.

When I switched to en_core_web_sm it still recognized Caunterbury as city, but tagged Engelond as a person.

Jaccard index and jazz albums

Posted on 26 July 2023 by John

Miles Davis Kind of Blue album cover

Jaccard index is a way of measuring the similarity of sets. The Jaccard index, or Jaccard similarity coefficient, of two sets A and B is the number of elements in their intersection, A ∩ B, divided by the number of elements in their union, A ∪ B.

$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$

Jaccard similarity is a robust way to compare things in machine learning, say in clustering algorithms, less sensitive to outliers than other similarity measures such as cosine similarity.

Miles Davis Albums

Here we’ll illustrate Jaccard similarity by looking at the personnel on albums by Miles Davis. Specifically, which pair of albums had more similar personnel: Kind of Blue and Round About Midnight, or Bitches Brew and In a Silent Way?

There were four musicians who played on both Kind of Blue and Round About Midnight: Miles Davis, Cannonball Adderly, John Coltrane, and Paul Chambers.

There were six musicians who played on both Bitches Brew and In a Silent Way: Miles Davis, Wayne Shorter, Chick Corea, Dave Holland, and John McLaughlin, Joe Zawinul.

The latter pair of albums had more personnel in common, but they also had more personnel in total.

There were 9 musicians who performed on either Kind of Blue or Round About Midnight. Since 4 played on both albums, the Jaccard index comparing the personnel on the two albums is 4/9.

In a Silent Way and especially Bitches Brew used more musicians. A total of 17 musicians performed on one of these albums, including 6 who were on both. So the Jaccard index is 6/17.

Jaccard distance

Jaccard distance is the complement of Jaccard similarity, i.e.

In our example, the Jaccard distance between Kind of Blue and Round About Midnight is 1 − 4/9 = 0.555. The Jaccard distance between Bitches Brew and In a Silent Way is 1 − 6/17 = 0.647.

Jaccard distance really is a distance. It is clearly a symmetric function of its arguments, unlike Kulback-Liebler divergence, which is not.

The difficulty in establishing that Jaccard distance is a distance function, i.e. a metric, is the triangle inequality. The triangle inequality does hold, though this is not simple to prove.

Trying NLP on Middle English

Posted on 25 July 2023 by John

It’s not fair to evaluate NLP software on a language it wasn’t designed to process, but I wanted to try it anyway.

The models in the spaCy software library were trained on modern English text and not on Middle English. Nevertheless, spaCy does a pretty good job of parsing Chaucer’s Canterbury Tales, written over 600 years ago. I used the model en_core_web_lg in my little experiment below.

The text I used comes from the prologue:

From every shires ende
of Engelond to Caunterbury they wende
the hooly blisful martir for to seke
that hem hath holpen
whan that they were seeke.

The software correctly identifies, for example, wende (went) and seke (seak) as verbs, and seeke (sick) as an adjective. Overall it does a pretty good job. I imagine it would do worse on Middle English text that differed more from Modern English usage, but so would a contemporary human who doesn’t know Middle English.

Natural language processing and unnatural text

Posted on 24 July 2023 by John

I recently evaluated two software applications designed to find PII (personally identifiable information) in free text using natural language processing. Both failed badly, passing over obvious examples of PII. By contrast, I also tried natural language processing software on a nonsensical poem, it the software did quite well.

Doctor’s notes

It occurred to me later that the software packages to search for PII probably assume “natural language” has the form of fluent prose, not choppy notes by physicians. The notes that I tested did not consist of complete sentences marked up with grammatically correct punctuation. The text may have been transcribed from audio.

Some software packages deidentify medical notes better than others. I’ve seen some work well and some work poorly. I suspect the former were written specifically for their purpose and the latter were more generic.

Jabberwocky

I also tried NLP software on Lewis Carroll’s poem Jabberwocky. It too is unnatural language, but in a different sense.

Jabberwocky uses nonsense words that Carroll invented for the poem, but otherwise it is grammatically correct. The poem is standard English at the level of structure, though not at the level of words. It is the opposite of medical notes that are standard English at the word level (albeit with a high density of technical terms), but not at a structural level.

I used the spaCy natural language processing library on a couple stanzas from Lewis’ poem.

“Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!”

He took his vorpal sword in hand;
Long time the manxome foe he sought—
So rested he by the Tumtum tree
And stood awhile in thought.

I fed the lines into spaCy and asked it to diagram the lines, indicating parts of speech and dependencies. The software did a good job of inferring the use of even the nonsense words. I gave the software one line at a time rather than a stanza at a time because the latter results in diagrams that are awkwardly wide, too wide to display here. (The spaCy visualization software has a “compact” option, but this option does not make the visualizations much more compact.)

Here are the visualizations of the lines.

And here is the Python code I used to create the diagrams above.

    import spacy
    from spacy import displacy
    from pathlib import Path
    
    nlp = spacy.load("en_core_web_sm")
        
    lines = [
        "Beware the Jabberwock, my son!",
        "The jaws that bite, the claws that catch!",
        "Beware the Jubjub bird",
        "Shun the frumious Bandersnatch!",
        "He took his vorpal sword in hand.",
        "Long time the manxome foe he sought",
        "So rested he by the Tumtum tree",
        "And stood awhile in thought."
    ]
    
    for line in lines:
        doc = nlp(line)
        svg = displacy.render(doc, style="dep", jupyter=False)    
        file_name = '-'.join([w.text for w in doc if not w.is_punct]) + ".svg"
        output_path = Path(file_name)
        output_path.open("w", encoding="utf-8").write(svg)

How an LLM might leak medical data

Posted on 23 July 2023 by John

Machine learning models occasionally memorize training data. Under the right prompt, a model could return portions of the training data verbatim.

If a large language model is trained on deidentified medical data, along with data that overlaps with the medical data, it could potentially leak details of a person’s medical history. I’m not saying that any particular LLM does leak data, but it’s something to watch out for if you are training an LLM.

One way to try to create a data leak would be to prompt the LLM with the data you have and see whether it will fill in more. For example, I asked Bing AI

Tell me about a man from Arkansas with ICD-10 code C50.921.

It came back with the following.

… Once upon a time, there was a man named John who lived in Arkansas. John was a hardworking man who loved spending time with his family and friends. One day, John noticed a lump in his right breast. He went to the doctor and was diagnosed with malignant neoplasm of unspecified site of the right male breast, classified under the ICD-10 code C50.921. …

I chose this diagnosis code because male breast cancer is extremely rare, and so the risk of reidentification is higher than for a more common diagnosis.

The model filled in the detail that the man’s name is John. This is not at all suspicious since John is the one of the most common male names in history. If the man’s name had been Vivek, there would be more cause to suspect the model is telling us about a real man name Vivek, though of course that could be a randomly chosen male name.

If a neural network were training on deidentified medical data, it could link fields together. If the story above had said “John, aged 42, …” the age might have been pulled from an actual patient record.

If the data the network was trained on was deidentified well, even leaking data verbatim should not create more than a very small risk of identification. However, if the data contained tokens linking the records to publicly available information, such as real estate records—this happens—then our hypothetical LLM might reveal more personal details that could be used to narrow down whose data is being leaked.

Experiments with Bing chat

Posted on 15 July 2023 by John

My two previous posts looked at experiments with ChatGPT and Google Bard. This post will look at redoing the same experiments with Microsoft’s Bing Chat: looking for mnemonic encodings and simplifying Boolean expressions.

When you open up Bing chat you can select a conversational style:

More creative
More balanced
More precise

I chose “more precise” for my experiments.

Mnemonic encodings

Bing seems to “know” more about the Major mnemonic system. In response to my first question about the system it gave a link to the Wikipedia article on the Major system and a couple other relevant links. It gives a more complete table of encodings than ChatGPT did:

0: s, z
1: t, d, th
2: n
3: m
4: r
5: l
6: j, ch, sh
7: c, k, g, q, ck
8: v, f, ph
9: p, b

Directly asking for encodings

It seems to do pretty well when I ask for an encoding of a two-digit number, and when I ask for an encoding of a three-digit number it encodes the first two digits correctly but ignores the third digit. However, when I asked it to encode 345 it did very well:

Using the Major System, the number 345 can be encoded as “molar” or “mailer” or “miller” or “mural” or “marl”, among others.

ChatGPT and Google Bard said that no English word can encode 345, though Google Bard came up with Merlin anyway. (Strictly speaking “Merlin” encodes 3452, but it’s a common practice to just pay attention the first three consonants in words with more than three consonants.)

Asking for sounds

When I asked “Find a word with the sounds g, r, and f” Bing came back with

One word that contains the sounds g, r, and f is “giraffe” 🦒

When I asked for the sounds l, f, and b, Bard came up with “belief.” When I added “in that order” to my prompt it came back with “bluff” and “flyby,” ignoring the order requirement.

In general Bing AI seems to do better that ChatGPT and Bard at word game questions.

Boolean minimization

Next I tried the Boolean minimization problem that I’d tried with ChatGPT and Bard using the prompt

Minimize the boolean expression ¬AB¬C¬D + A¬B¬C¬D + A¬B¬CD + A¬BC¬D + A¬BCD + AB¬C¬D + ABC¬D + ABCD

ChatGPT tried a Karnaugh map approach, but botched it. Bard tried an algebraic approach, but also botched it.

Bing tried something different:

The given boolean expression can be simplified using the consensus theorem. The consensus theorem states that for any three variables A, B, and C: AB + AC + BC = AB + AC. Applying this theorem to the given expression, we get

¬AB¬C¬D + A¬B¬C¬D + A¬B¬CD + A¬BC¬D + A¬BCD + AB¬C¬D + ABC¬D + ABCD
= ¬AB¬C¬D + A¬B¬C¬D + A¬B¬CD + A¬BC¬D + A¬BCD + AB¬C¬D + ABC
= ¬AB¬C¬D + A( ¬B¬C¬D + ¬B¬CD + ¬BC¬D + ¬BCD) + AB( ¬C¬D) + ABC
= ¬AB( ¬C) ( ¬D) + A( ¬B) ( ¬C( ¬D) + ¬CD) + AB( ¬C( ¬D)) + ABC
= ¬AB( ¬C) ( ¬D) + A( ¬B) ( ¬CD) + AB( ¬C( ¬D)) + ABC
= (A+B)(A+C)(A+D)(B+C)(B+D)(C+D)

This derivation starts out correctly but goes astray about halfway through.

Boolean function minimization with AI

Posted on 15 July 2023 by John

Screenshot of Karnaugh map produced by ChatGPT

I was curious how well LLMs would do at minimizing a Boolean expression, that is, taking a Boolean expression and producing a smaller equivalent expression.

I didn’t expect good performance because this problem is more about logic than recall, but sometimes LLMs surprise you, so I wanted to give it a chance. I thought it might accomplish some reduction in size though not as much as possible. Let’s see what it did.

Prompt

My prompt for the three platforms I tried—GPT4All, ChatGPT, and Google Bard—was

Minimize the boolean expression ¬AB¬C¬D + A¬B¬C¬D + A¬B¬CD + A¬BC¬D + A¬BCD + AB¬C¬D + ABC¬D + ABCD

I knew the expression could be minimized to

B¬C¬D + A¬D + AC.

GPT4All

I first tried GPT4All using the Groovy data set. It crashed immediately. I tried a few variations on the same prompt and it crashed right away every time.

This was on a MacBook Pro. I didn’t try it on Windows or Linux.

ChatGPT

I’ll go through ChatGPT’s response a little at a time.

To minimize the given boolean expression ¬AB¬C¬D + A¬B¬C¬D + A¬B¬CD + A¬BC¬D + A¬BCD + AB¬C¬D + ABC¬D + ABCD, we can use Karnaugh maps or Boolean algebra simplification techniques. I’ll demonstrate both methods.

This is a sensible start. Karnaugh maps are a common classroom technique for solving small Boolean minimization problems, so there were lots of examples of Karnaugh maps in the data scraped to train ChatGPT.

Next ChatGPT draws a Karnaugh map. The screenshot at the top of the post was from a different experiment, a version of the experiment described here but containing an error on my part. This problem produced a black image simply containing “CD” but also contained a Karnaugh map in text. Cleaning up the formatting we get the following.

       | 00 | 01 | 11 | 10 |
    00 |  1 |  0 |  1 |  0 |
    01 |  1 |  1 |  1 |  1 |
    11 |  1 |  1 |  1 |  1 |
    10 |  0 |  1 |  0 |  0 |

This is nonsense. Nine out of sixteen entries are wrong.

ChatGPT concludes

Grouping the adjacent 1’s in the Karnaugh map, we can simplify the expression to: ¬A¬C + ¬A¬D + A¬B + ACD

This is also nonsense. It does not correctly solve the original problem, nor does it solve the problem corresponding to the incorrect Karnaugh map above.

Next let’s see how it does with Boolean algebra. It starts out OK.

Boolean Algebra Method: We can simplify the expression using Boolean algebra identities and rules.

¬AB¬C¬D + A¬B¬C¬D + A¬B¬CD + A¬BC¬D + A¬BCD + AB¬C¬D + ABC¬D + ABCD

Rearranging the terms: A¬B¬C¬D + A¬B¬CD + AB¬C¬D + A¬BC¬D + ABC¬D + A¬BCD + ABCD + A¬B¬C¬D

Grouping the terms: A¬B¬C¬D + A¬B¬CD + AB¬C¬D + A¬BC¬D + ABC¬D + A¬BCD + ABCD

Combining like terms: (A + A)¬B¬C¬D + (A + A)¬B¬CD + AB¬C¬D + (A + A)¬BC¬D + ABC¬D + A¬BCD + ABCD

These steps haven’t accomplished anything, but they’re not wrong. Then it goes off the rails.

Using the complement law A + ¬A = 1: ¬B¬C¬D + ¬B¬CD + AB¬C¬D + ¬BC¬D + ABC¬D + A¬BCD + ABCD

The explanation is wrong and the solution is wrong.

The solution is completely wrong, but it looks plausible. At first glance I thought it might be right, or more right than it was. If you were making a movie and you needed to generate electrical engineering homework, this would be adequate for a realistic-looking homework solution.

Google Bard

Google Bard gives a more succinct solution with five steps:

Identify the common terms in the expression. The common terms are ¬AB¬C¬D, A¬B¬CD, and ABC¬D.
Combine the common terms. This gives us the expression (¬AB¬C¬D + A¬B¬CD + ABC¬D) + (A¬B¬C¬D + AB¬C¬D).
Apply the distribution property. This gives us the expression ¬AB¬C¬D + A¬B¬CD + ABC¬D + A¬B¬C¬D + AB¬C¬D.
Identify the redundant terms. The redundant terms are ¬AB¬C¬D and A¬B¬C¬D.
Remove the redundant terms. This gives us the expression A¬B¬CD + ABC¬D.

This too is nonsense.

Large language models and mnemonics

Posted on 14 July 2023 by John

The Major mnemonic system encodes numbers as words in order to make them easier to remember. Digits correspond to consonant sounds (not spellings) as explained here. You can use the system ad hoc, improvising an encoding of a word as needed, or you can memorize canonical encodings of numbers, also known as pegs.

Pegs have couple advantages. For one, they are eventually faster. Rather than search for an encoding, you recall a predetermined association. For another, pegs are better for memorizing numbered lists. To recall the nth item in a list, pull up the one-and-only image associated with that number and recall what you associated with it.

Coming up with encodings for most numbers is easy, but for a few holdouts it’s challenging. I created a personal list of pegs for numbers up to 1000, and a few of these are idiosyncratic, such as names of people who are not well known and some bend the rules of the Major system.

I decided to try the ChatGPT and Google Bard large language models to see whether they could help me find pegs that I could share with a general audience, free of obscure references.

Asking directly about the Major system

ChatGPT and Google Bard both seems to have some idea of the Major mnemonic system. As part of its reply to one of my prompts, Google Bard returned this:

    0: S, Z, soft C
    1: D, T, TH
    2: N, L, M
    3: M
    4: R
    5: L
    6: J, SH, soft CH, DG
    7: K
    8: P
    9: H

which is correct, except for 8 and 9, and some incorrect alternatives for 2.

When I asked ChatGPT for “Major mnemonic system encoding table” it returned a correct table. But when I asked it for encodings of numbers, it used different rules. For example, it seemed to systematically use M for 2.

I asked both systems

Find a word that encodes the number 792 in the Major memory system

ChatGPT came back with KNP and GNP. These aren’t words, and it seems ChatGPT might have dyslexia.

Bard said

There is no word in the English language that encodes the number 792 in the Major memory system. The closest word is “Sink”, but it only encodes the digits 7 and 9.

This was meant to be an easy question since there are common English words to encode 792, such as cabin. Neither system came up with good responses to the numbers I found challenging, and it may be that there simple are no English words with the desired sequence of sounds.

Before moving on, I wanted to mention one response from Google Bard that I found amusing. When I asked

Find a word that encodes 345 using the Major mnemonic system

Bard responded with

… The encoded number 345 is therefore MERLE.

However, there is no word in the English language that is spelled exactly as MERLE. However, there are a few words that are similar, such as Merry, Mettle, and Melt.

If you are looking for a word that is spelled exactly as MERLE, you may need to create a new word. For example, you could create the word Merlin, which is a name that is often associated with magic and wisdom.

Asking for sounds

The Major system is not well known, so I didn’t expect a direct approach to be that helpful. Next I tried asking for words with particular sequences of sounds rather than asking directly about the Major system.

When I asked both systems

Find a word containing the consonant sounds k, b, and n, in that order

ChatGPT came back with cabin and Bard said

There is no word in the English language that contains the consonant sounds k, b, and n in that order. The closest word is “knob”, but the n sound is pronounced as “ng” in this word.

So the point goes to GPT on that one.

Both systems did poorly on the numbers I found a hard time finding pegs for, possibly because there aren’t any (or many) English words with the specified sounds. With easier questions, the results were mixed.

When I asked

Find a word with the sounds g, r, and f

I was thinking giraffe. ChatGPT returned graf and Bard returned graft.

When I asked

Find a word with sounds l, f, and b

I was thinking, for example, lifeboat. But ChatGPT came back with

A word that contains the consonant sounds “l,” “f,” and “b” is “elfishly.”

and Bard replied

The word “blurb” contains the sounds l, f, and b in that order.

Phonetics

Both systems did well when I asked directly for the phonetic spelling of a word. So I tried to get both systems to come up with lifeboat by asking

Words whose IPA phonetic spellings contain l, f, and b

Bard found words that had some of these letters but not all. When I was more specific, asking

Words whose IPA phonetic spellings contain each of the letters l, f, and b

the results weren’t better.

ChatGPT, on the other hand, came up with affleb, triflingly, bluff, lifeboat, and fallopian. When I asked the more specific question, adding “each of”, ChatGPT returned alphabet, fulbright, bubluous, flyable, and belltry. When I added “in that order” to the prompt, ChatGPT said there are no such words, despite having some up with several such words a moment earlier.

Computing

Generating and inspecting an RSA private key

Related posts

Date sequence from the command line

How it works

Named entity recognition

Jaccard index and jazz albums

Miles Davis Albums

Jaccard distance

Trying NLP on Middle English

Related posts

Natural language processing and unnatural text

Doctor’s notes

Jabberwocky

Related posts

How an LLM might leak medical data

Related posts

Experiments with Bing chat

Mnemonic encodings

Directly asking for encodings

Asking for sounds

Boolean minimization

Boolean function minimization with AI

Prompt

GPT4All

ChatGPT

Google Bard

Large language models and mnemonics

Asking directly about the Major system

Asking for sounds

Phonetics

Related posts