Searching for proper nouns

Suppose you want to find all the proper nouns in a document. You could grep for every word that starts with a capital letter with something like

    grep '\b[A-Z]\w+'

but this would return the first word of each sentence in addition to the words you’re after.

You could grep for capitalized words that are not preceded by a period or question mark followed by a space.

    grep -P '(?<![.?] )\b[A-Z]\w+'

That’s possibly better, but it misses proper nouns at the beginning of a sentence.

You might be able to accomplish what you’re after by tinkering with regular expressions, but it would be better to use a library that has some idea of what a proper noun is.

NLP with spaCy

The Python natural language processing library spaCy classifies words by part of speech, and so could in particular search for proper nouns.

Here’s an example using the opening lines of Moby Dick.

    import spacy
    nlp = spacy.load("en_core_web_lg")

    doc = nlp("Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul ... I account it high time to get to sea as soon as I can.")

    for tok in doc:
        if tok.pos_ == "PROPN":
            print(tok)

This will print Ishmael and November only. It does not print words at the beginning of a sentence such as Call or Some even though they are capitalized. When spaCy got to the line

Queequeg was George Washington cannibalistically developed.

it detected that Queequeg is a proper noun. Presumably the model can tell this from context, because the word precedes the verb was and not because it knows Queeqeug is proper name.

When I changed November to november spaCy was still able to detect that november was a proper noun. When I downcased Ishmael it did not detect that ishmael was a proper noun, presumably because Ishmael is an uncommon name. When I changed the text to “Call me tim” the library did recognize tim as a proper noun.

When I fed spaCy the sentence

I never go as a passenger; nor, though I am something of a salt, do I ever go to sea as a Commodore, or a Captain, or a Cook.

the library thought that Commadore, Captain, and Cook were proper nouns. If I downcase these words, spaCy does not flag them as proper nouns.

When processing the line

For as in this world,head winds are far more prevalent than winds from astern (that is, if you never violate the Pythagorean maxim), so for the most part the Commodore on the quarter-deck gets his atmosphere at second hand from the sailors on the forecastle

spaCy correctly flagged Commodore as a proper noun in this instance. Also, it did not classify Pythagorean as a proper noun; the word is proper but not a noun, i.e. it’s a proper adjective.

TANSTAAFL

My script above has only six lines of code. But it depends on a library that uses a 588 MB language model. [1]

Related posts

[1] “TANSTAALF” stands for “There ain’t no such thing as a free lunch.” It comes from The Moon is a Harsh Mistress by Heinlein.

Incidentally, when I fed “The term TANSTAAFL comes from The Moon is a Harsh Mistress by Heinlein.” to spaCy, it flagged Harsh and Mistress as proper nouns.

When I fed it “The term TANSTAAFL comes from ‘The moon is a harsh mistress’ by Heinlein.” the library correctly tagged harsh as an adjective and mistress as a (non-proper) noun.

Large language models and mnemonics

The Major mnemonic system encodes numbers as words in order to make them easier to remember. Digits correspond to consonant sounds (not spellings) as explained here. You can use the system ad hoc, improvising an encoding of a word as needed, or you can memorize canonical encodings of numbers, also known as pegs.

Pegs have couple advantages. For one, they are eventually faster. Rather than search for an encoding, you recall a predetermined association. For another, pegs are better for memorizing numbered lists. To recall the nth item in a list, pull up the one-and-only image associated with that number and recall what you associated with it.

Coming up with encodings for most numbers is easy, but for a few holdouts it’s challenging. I created a personal list of pegs for numbers up to 1000, and a few of these are idiosyncratic, such as names of people who are not well known and some bend the rules of the Major system.

I decided to try the ChatGPT and Google Bard large language models to see whether they could help me find pegs that I could share with a general audience, free of obscure references.

Asking directly about the Major system

ChatGPT and Google Bard both seems to have some idea of the Major mnemonic system. As part of its reply to one of my prompts, Google Bard returned this:

    0: S, Z, soft C
    1: D, T, TH
    2: N, L, M
    3: M
    4: R
    5: L
    6: J, SH, soft CH, DG
    7: K
    8: P
    9: H

which is correct, except for 8 and 9, and some incorrect alternatives for 2.

When I asked ChatGPT for “Major mnemonic system encoding table” it returned a correct table. But when I asked it for encodings of numbers, it used different rules. For example, it seemed to systematically use M for 2.

I asked both systems

Find a word that encodes the number 792 in the Major memory system

ChatGPT came back with KNP and GNP. These aren’t words, and it seems ChatGPT might have dyslexia.

Bard said

There is no word in the English language that encodes the number 792 in the Major memory system. The closest word is “Sink”, but it only encodes the digits 7 and 9.

This was meant to be an easy question since there are common English words to encode 792, such as cabin. Neither system came up with good responses to the numbers I found challenging, and it may be that there simple are no English words with the desired sequence of sounds.

Before moving on, I wanted to mention one response from Google Bard that I found amusing. When I asked

Find a word that encodes 345 using the Major mnemonic system

Bard responded with

… The encoded number 345 is therefore MERLE.

However, there is no word in the English language that is spelled exactly as MERLE. However, there are a few words that are similar, such as Merry, Mettle, and Melt.

If you are looking for a word that is spelled exactly as MERLE, you may need to create a new word. For example, you could create the word Merlin, which is a name that is often associated with magic and wisdom.

Asking for sounds

The Major system is not well known, so I didn’t expect a direct approach to be that helpful. Next I tried asking for words with particular sequences of sounds rather than asking directly about the Major system.

When I asked both systems

Find a word containing the consonant sounds k, b, and n, in that order

ChatGPT came back with cabin and Bard said

There is no word in the English language that contains the consonant sounds k, b, and n in that order. The closest word is “knob”, but the n sound is pronounced as “ng” in this word.

So the point goes to GPT on that one.

Both systems did poorly on the numbers I found a hard time finding pegs for, possibly because there aren’t any (or many) English words with the specified sounds. With easier questions, the results were mixed.

When I asked

Find a word with the sounds g, r, and f

I was thinking giraffe. ChatGPT returned graf and Bard returned graft.

When I asked

Find a word with sounds l, f, and b

I was thinking, for example, lifeboat. But ChatGPT came back with

A word that contains the consonant sounds “l,” “f,” and “b” is “elfishly.”

and Bard replied

The word “blurb” contains the sounds l, f, and b in that order.

Phonetics

Both systems did well when I asked directly for the phonetic spelling of a word. So I tried to get both systems to come up with lifeboat by asking

Words whose IPA phonetic spellings contain l, f, and b

Bard found words that had some of these letters but not all. When I was more specific, asking

Words whose IPA phonetic spellings contain each of the letters l, f, and b

the results weren’t better.

ChatGPT, on the other hand, came up with affleb, triflingly, bluff, lifeboat, and fallopian. When I asked the more specific question, adding “each of”, ChatGPT returned alphabet, fulbright, bubluous, flyable, and belltry. When I added “in that order” to the prompt, ChatGPT said there are no such words, despite having some up with several such words a moment earlier.

Related posts

Levenshtein distance from Finnegans Wake to Return of the Jedi

Ewok

I ran into a delightfully strange blog post today called Finnegans Ewok that edits the first few paragraphs of Finnegans Wake to make it into something like Return of the Jedi.

(Unfortunately the page has gone away since I first wrote this. Some of the text is preserved in this Python script.)

The author, Adam Roberts, said via Twitter “What I found interesting here was how little I had to change Joyce’s original text. Tweak a couple of names and basically leave it otherwise as was.”

So what I wanted to do is quantify just how much had to change using the Levenshtein distance, which is essentially the number of one-character changes necessary to transform one string into another.

Here’s the first paragraph from James Joyce:

riverrun, past Eve and Adam’s, from swerve of shore to bend of bay, brings us by a commodius vicus of recirculation back to Howth Castle and Environs.

And here’s the first paragraph from Adam Roberts:

movierun, past new and hopes, from strike of back to bend of jeday, brings us by a commodius lucas of recirculation back to forestmoon and endor.

The original paragraph is 150 characters, the parody is 145 characters, and the Levenshtein distance is 44.

Here’s a summary of the results for the first four paragraphs.

    |-------+---------+----------|
    | Joyce | Roberts | Distance |
    |-------+---------+----------|
    |   150 |     145 |       44 |
    |   700 |     727 |      119 |
    |   594 |     615 |      145 |
    |  1053 |     986 |      333 |
    |-------+---------+----------|

The fifth paragraph seems to diverge more from Joyce. I maybe have gotten something misaligned, and reading enough of Finnegans Wake to debug the problem made my head hurt, so I stopped.

Update: See the next post for sequence alignment applied to the two sources. This lets you see not just the number of edits but what the edits are. This show why I was having difficulty aligning the fifth paragraphs.

Related posts

Unnatural language processing

Japanese Russian dictionary

Larry Wall, creator of the Perl programming language, created a custom degree plan in college, an interdisciplinary course of study in natural and artificial languages, i.e. linguistics and programming languages. Many of the features of Perl were designed as an attempt to apply natural language principles to the design of an artificial language.

I’ve been thinking of a different connection between natural and artificial languages, namely using natural language processing (NLP) to reverse engineer source code.

The source code of computer program is text, but not a text. That is, it consists of plain text files, but it’s not a text in the sense that Paradise Lost or an email is a text. The most efficient way to parse a programming language is as a programming language. Treating it as an English text will loose vital structure, and wrongly try to impose a foreign structure.

But what if you have two computer programs? That’s the problem I’ve been thinking about. I have code in two very different programming languages, and I’d like to know how functions in one code base relate to those in the other. The connections are not ones that a compiler could find. The connections are more psychological than algorithmic. I’d like to reverse engineer, for example, which function in language A a developer had in mind when he wrote a function in language B.

Both code bases are in programming language, but the function names are approximately natural language. If a pair of functions have the same name in both languages, and that name is not generic, then there’s a good chance they’re related. And if the names are similar, maybe they’re related.

I’ve done this sort of thing informally forever. I imagine most programmers do something like this from time to time. But only recently have I needed to do this on such a large scale that proceeding informally was not an option. I wrote a script to automate some of the work by looking for fuzzy matches between function names in both languages. This was far from perfect, but it reduced the amount of sleuthing necessary to line up the two sets of source code.

Around a year ago I had to infer which parts of an old Fortran program corresponded to different functions in a Python program. I also had to infer how some poorly written articles mapped to either set of source code. I did all this informally, but I wonder now whether NLP might have sped up my detective work.

Another situation where natural language processing could be helpful in software engineering is determining code authorship. Again this is something most programmers have probably done informally, saying things like “I bet Bill wrote this part of the code because it looks like his style” or “Looks like Pat left her fingerprints here.” This could be formalized using NLP techniques, and I imagine it has been. Just as Frederick Mosteller and colleagues did a statistical analysis of The Federalist Papers to determine who wrote which paper, I’m sure there have been similar analyses to try to find out who wrote what code, say for legal reasons.

Maybe this already has a name, but I like “unnatural language processing” for the application of natural language processing to unnatural (i.e. programming) languages. I’ve done a lot of ad hoc unnatural language processing, and I’m curious how much of it I could automate in the future.

Related NLP posts