Named entity recognition

Named entity recognition (NER) is a task of natural language processing: pull out named things text. It sounds like trivial at first. Just create a giant list of named things and compare against that.

But suppose, for example, University of Texas is on your list. If Texas is also on your list, do you report that you have a named entity inside a named entity? And how do you handle The University of Texas? Do you put it on your list as well? What about UT? Can you tell from context whether UT stands for University of Texas, University of Toronto, or the state of Utah?

Searching for Rice University would be even more fun. The original name of the school was The William Marsh Rice Institute for the Advancement of Letters, Science, and Art. I don’t know whether the name was ever officially changed. A friend who went to Rice told me they had a ridiculous cheer that spelled out every letter in the full name. And of course rice could refer to a grain.

Let’s see what happens when we run the following sentence through spaCy looking for named entities.

Researchers from the University of Texas at Austin organized a picleball game with their colleagues from Rice University on Tuesday.

I deliberately did not capitalize the definite article in front of University of Texas because I suspected spaCy might include the article if it were capitalized but not otherwise. It included the article in either case.

The results depend on the language model used. When I used en_core_web_trf it included at Austin as part of the university name.

When I used the smaller en_core_web_sm model it pulled out Austin as a separate entity.

The tag ORG stands for organization and DATE obviously stands for date. GPE is a little less obvious, standing for geopolitical entity.

When I changed Rice University to simply Rice, spaCy still recognized Rice as an organization. When I changed it to rice with no capitalization, it did not recognize it as an organization.

The other day I stress tested spaCy by giving it some text from Chaucer’s Canterbury Tales. Even though spaCy is trained on Modern English, it did better than I would have expected on Middle English.

Using the en_core_web_trf model it recognizes Engelond and Caunterbury as cities.

When I switched to  en_core_web_sm it still recognized Caunterbury as city, but tagged Engelond as a person.

 

Trying NLP on Middle English

It’s not fair to evaluate NLP software on a language it wasn’t designed to process, but I wanted to try it anyway.

The models in the spaCy software library were trained on modern English text and not on Middle English. Nevertheless, spaCy does a pretty good job of parsing Chaucer’s Canterbury Tales, written over 600 years ago. I used the model en_core_web_lg in my little experiment below.

The text I used comes from the prologue:

From every shires ende
of Engelond to Caunterbury they wende
the hooly blisful martir for to seke
that hem hath holpen
whan that they were seeke.

The software correctly identifies, for example, wende (went) and seke (seak) as verbs, and seeke (sick) as an adjective. Overall it does a pretty good job. I imagine it would do worse on Middle English text that differed more from Modern English usage, but so would a contemporary human who doesn’t know Middle English.

Related posts

Natural language processing and unnatural text

I recently evaluated two software applications designed to find PII (personally identifiable information) in free text using natural language processing. Both failed badly, passing over obvious examples of PII. By contrast, I also tried natural language processing software on a nonsensical poem, it the software did quite well.

Doctor’s notes

It occurred to me later that the software packages to search for PII probably assume “natural language” has the form of fluent prose, not choppy notes by physicians. The notes that I tested did not consist of complete sentences marked up with grammatically correct punctuation. The text may have been transcribed from audio.

Some software packages deidentify medical notes better than others. I’ve seen some work well and some work poorly. I suspect the former were written specifically for their purpose and the latter were more generic.

Jabberwocky

I also tried NLP software on Lewis Carroll’s poem Jabberwocky. It too is unnatural language, but in a different sense.

Jabberwocky uses nonsense words that Carroll invented for the poem, but otherwise it is grammatically correct. The poem is standard English at the level of structure, though not at the level of words. It is the opposite of medical notes that are standard English at the word level (albeit with a high density of technical terms), but not at a structural level.

I used the spaCy natural language processing library on a couple stanzas from Lewis’ poem.

“Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!”

He took his vorpal sword in hand;
Long time the manxome foe he sought—
So rested he by the Tumtum tree
And stood awhile in thought.

I fed the lines into spaCy and asked it to diagram the lines, indicating parts of speech and dependencies. The software did a good job of inferring the use of even the nonsense words. I gave the software one line at a time rather than a stanza at a time because the latter results in diagrams that are awkwardly wide, too wide to display here. (The spaCy visualization software has a “compact” option, but this option does not make the visualizations much more compact.)

Here are the visualizations of the lines.








And here is the Python code I used to create the diagrams above.

    import spacy
    from spacy import displacy
    from pathlib import Path
    
    nlp = spacy.load("en_core_web_sm")
        
    lines = [
        "Beware the Jabberwock, my son!",
        "The jaws that bite, the claws that catch!",
        "Beware the Jubjub bird",
        "Shun the frumious Bandersnatch!",
        "He took his vorpal sword in hand.",
        "Long time the manxome foe he sought",
        "So rested he by the Tumtum tree",
        "And stood awhile in thought."
    ]
    
    for line in lines:
        doc = nlp(line)
        svg = displacy.render(doc, style="dep", jupyter=False)    
        file_name = '-'.join([w.text for w in doc if not w.is_punct]) + ".svg"
        output_path = Path(file_name)
        output_path.open("w", encoding="utf-8").write(svg)

Related posts

Filtering on how words are being used

Yesterday I wrote about how you could use the spaCy Python library to find proper nouns in a document. Now suppose you want to refine this and find proper nouns that are the subjects of sentences or proper nouns that are direct objects.

This post was motivated by a project in which I needed to pull out company names from a large amount of text, and it was important to know how the company name was being used.

Dependency labels

Tokens in spaCy have a dependency label attribute dep (or dep_ for its string representation). Dependency labels tell you how a word is being used. For example, dobj tells you the word is being used as a direct object, and nsubj tells you its being used as a nominal subject.

In yesterday’s post the line

    if tok.pos_ == "PROPN":
        print(tok)

filtered tokens to look for proper nouns. We could modify the script to also tell us how the proper nouns are being used by printing tok.dep_.

There are three proper nouns in the opening paragraph of Moby Dick: Ishmael, November, and Cato.

Call me Ishmael. … whenever it is a damp, drizzly November in my soul … With a philosophical flourish Cato throws himself upon his sword …

If we run

    if tok.pos_ == "PROPN":
        print(tok, tok.dep_)

on the first paragraph we get

    Ishmael oprd
    November attr
    Cato nsubj

but it’s not obvious what the output means. If we wrap tok.dep_ with spacy.explain we get a more verbose explanation.

    Ishmael object predicate
    November attribute
    Cato nominal subject

Pulling out subjects

Now suppose we wanted to pull out words that are subjects. We could filter on tok.dep_ == "nsubj" but there are more kinds of subjects than just nominal subjects. There are six kinds of subjects:

  1. nsubj: nominal subject
  2. nsubjpass: nominal passive subject
  3. csubj: clausal subject
  4. csubjpass: clausal passive subject
  5. agent: agent
  6. expl: expletive

Finding the range of possible values for dependency labels takes some digging. I don’t believe it’s in the spaCy documentation per se, but if you’re persistent you’ll find a link this list or the paper it came from.

Searching for proper nouns

Suppose you want to find all the proper nouns in a document. You could grep for every word that starts with a capital letter with something like

    grep '\b[A-Z]\w+'

but this would return the first word of each sentence in addition to the words you’re after.

You could grep for capitalized words that are not preceded by a period or question mark followed by a space.

    grep -P '(?<![.?] )\b[A-Z]\w+'

That’s possibly better, but it misses proper nouns at the beginning of a sentence.

You might be able to accomplish what you’re after by tinkering with regular expressions, but it would be better to use a library that has some idea of what a proper noun is.

NLP with spaCy

The Python natural language processing library spaCy classifies words by part of speech, and so could in particular search for proper nouns.

Here’s an example using the opening lines of Moby Dick.

    import spacy
    nlp = spacy.load("en_core_web_lg")

    doc = nlp("Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul ... I account it high time to get to sea as soon as I can.")

    for tok in doc:
        if tok.pos_ == "PROPN":
            print(tok)

This will print Ishmael and November only. It does not print words at the beginning of a sentence such as Call or Some even though they are capitalized. When spaCy got to the line

Queequeg was George Washington cannibalistically developed.

it detected that Queequeg is a proper noun. Presumably the model can tell this from context, because the word precedes the verb was and not because it knows Queequeg is proper name.

When I changed November to november spaCy was still able to detect that november was a proper noun. When I downcased Ishmael it did not detect that ishmael was a proper noun, presumably because Ishmael is an uncommon name. When I changed the text to “Call me tim” the library did recognize tim as a proper noun.

When I fed spaCy the sentence

I never go as a passenger; nor, though I am something of a salt, do I ever go to sea as a Commodore, or a Captain, or a Cook.

the library thought that Commodore, Captain, and Cook were proper nouns. If I downcase these words, spaCy does not flag them as proper nouns.

When processing the line

For as in this world,head winds are far more prevalent than winds from astern (that is, if you never violate the Pythagorean maxim), so for the most part the Commodore on the quarter-deck gets his atmosphere at second hand from the sailors on the forecastle

spaCy correctly flagged Commodore as a proper noun in this instance. Also, it did not classify Pythagorean as a proper noun; the word is proper but not a noun, i.e. it’s a proper adjective.

TANSTAAFL

My script above has only six lines of code. But it depends on a library that uses a 588 MB language model. [1]

Related posts

[1] “TANSTAALF” stands for “There ain’t no such thing as a free lunch.” It comes from The Moon is a Harsh Mistress by Heinlein.

Incidentally, when I fed “The term TANSTAAFL comes from The Moon is a Harsh Mistress by Heinlein.” to spaCy, it flagged Harsh and Mistress as proper nouns.

When I fed it “The term TANSTAAFL comes from ‘The moon is a harsh mistress’ by Heinlein.” the library correctly tagged harsh as an adjective and mistress as a (non-proper) noun.

Large language models and mnemonics

The Major mnemonic system encodes numbers as words in order to make them easier to remember. Digits correspond to consonant sounds (not spellings) as explained here. You can use the system ad hoc, improvising an encoding of a word as needed, or you can memorize canonical encodings of numbers, also known as pegs.

Pegs have couple advantages. For one, they are eventually faster. Rather than search for an encoding, you recall a predetermined association. For another, pegs are better for memorizing numbered lists. To recall the nth item in a list, pull up the one-and-only image associated with that number and recall what you associated with it.

Coming up with encodings for most numbers is easy, but for a few holdouts it’s challenging. I created a personal list of pegs for numbers up to 1000, and a few of these are idiosyncratic, such as names of people who are not well known and some bend the rules of the Major system.

I decided to try the ChatGPT and Google Bard large language models to see whether they could help me find pegs that I could share with a general audience, free of obscure references.

Asking directly about the Major system

ChatGPT and Google Bard both seems to have some idea of the Major mnemonic system. As part of its reply to one of my prompts, Google Bard returned this:

    0: S, Z, soft C
    1: D, T, TH
    2: N, L, M
    3: M
    4: R
    5: L
    6: J, SH, soft CH, DG
    7: K
    8: P
    9: H

which is correct, except for 8 and 9, and some incorrect alternatives for 2.

When I asked ChatGPT for “Major mnemonic system encoding table” it returned a correct table. But when I asked it for encodings of numbers, it used different rules. For example, it seemed to systematically use M for 2.

I asked both systems

Find a word that encodes the number 792 in the Major memory system

ChatGPT came back with KNP and GNP. These aren’t words, and it seems ChatGPT might have dyslexia.

Bard said

There is no word in the English language that encodes the number 792 in the Major memory system. The closest word is “Sink”, but it only encodes the digits 7 and 9.

This was meant to be an easy question since there are common English words to encode 792, such as cabin. Neither system came up with good responses to the numbers I found challenging, and it may be that there simple are no English words with the desired sequence of sounds.

Before moving on, I wanted to mention one response from Google Bard that I found amusing. When I asked

Find a word that encodes 345 using the Major mnemonic system

Bard responded with

… The encoded number 345 is therefore MERLE.

However, there is no word in the English language that is spelled exactly as MERLE. However, there are a few words that are similar, such as Merry, Mettle, and Melt.

If you are looking for a word that is spelled exactly as MERLE, you may need to create a new word. For example, you could create the word Merlin, which is a name that is often associated with magic and wisdom.

Asking for sounds

The Major system is not well known, so I didn’t expect a direct approach to be that helpful. Next I tried asking for words with particular sequences of sounds rather than asking directly about the Major system.

When I asked both systems

Find a word containing the consonant sounds k, b, and n, in that order

ChatGPT came back with cabin and Bard said

There is no word in the English language that contains the consonant sounds k, b, and n in that order. The closest word is “knob”, but the n sound is pronounced as “ng” in this word.

So the point goes to GPT on that one.

Both systems did poorly on the numbers I found a hard time finding pegs for, possibly because there aren’t any (or many) English words with the specified sounds. With easier questions, the results were mixed.

When I asked

Find a word with the sounds g, r, and f

I was thinking giraffe. ChatGPT returned graf and Bard returned graft.

When I asked

Find a word with sounds l, f, and b

I was thinking, for example, lifeboat. But ChatGPT came back with

A word that contains the consonant sounds “l,” “f,” and “b” is “elfishly.”

and Bard replied

The word “blurb” contains the sounds l, f, and b in that order.

Phonetics

Both systems did well when I asked directly for the phonetic spelling of a word. So I tried to get both systems to come up with lifeboat by asking

Words whose IPA phonetic spellings contain l, f, and b

Bard found words that had some of these letters but not all. When I was more specific, asking

Words whose IPA phonetic spellings contain each of the letters l, f, and b

the results weren’t better.

ChatGPT, on the other hand, came up with affleb, triflingly, bluff, lifeboat, and fallopian. When I asked the more specific question, adding “each of”, ChatGPT returned alphabet, fulbright, bubluous, flyable, and belltry. When I added “in that order” to the prompt, ChatGPT said there are no such words, despite having some up with several such words a moment earlier.

Related posts

Levenshtein distance from Finnegans Wake to Return of the Jedi

Ewok

I ran into a delightfully strange blog post today called Finnegans Ewok that edits the first few paragraphs of Finnegans Wake to make it into something like Return of the Jedi.

(Unfortunately the page has gone away since I first wrote this. Some of the text is preserved in this Python script.)

The author, Adam Roberts, said via Twitter “What I found interesting here was how little I had to change Joyce’s original text. Tweak a couple of names and basically leave it otherwise as was.”

So what I wanted to do is quantify just how much had to change using the Levenshtein distance, which is essentially the number of one-character changes necessary to transform one string into another.

Here’s the first paragraph from James Joyce:

riverrun, past Eve and Adam’s, from swerve of shore to bend of bay, brings us by a commodius vicus of recirculation back to Howth Castle and Environs.

And here’s the first paragraph from Adam Roberts:

movierun, past new and hopes, from strike of back to bend of jeday, brings us by a commodius lucas of recirculation back to forestmoon and endor.

The original paragraph is 150 characters, the parody is 145 characters, and the Levenshtein distance is 44.

Here’s a summary of the results for the first four paragraphs.

    |-------+---------+----------|
    | Joyce | Roberts | Distance |
    |-------+---------+----------|
    |   150 |     145 |       44 |
    |   700 |     727 |      119 |
    |   594 |     615 |      145 |
    |  1053 |     986 |      333 |
    |-------+---------+----------|

The fifth paragraph seems to diverge more from Joyce. I maybe have gotten something misaligned, and reading enough of Finnegans Wake to debug the problem made my head hurt, so I stopped.

Update: See the next post for sequence alignment applied to the two sources. This lets you see not just the number of edits but what the edits are. This show why I was having difficulty aligning the fifth paragraphs.

Related posts

Unnatural language processing

Japanese Russian dictionary

Larry Wall, creator of the Perl programming language, created a custom degree plan in college, an interdisciplinary course of study in natural and artificial languages, i.e. linguistics and programming languages. Many of the features of Perl were designed as an attempt to apply natural language principles to the design of an artificial language.

I’ve been thinking of a different connection between natural and artificial languages, namely using natural language processing (NLP) to reverse engineer source code.

The source code of computer program is text, but not a text. That is, it consists of plain text files, but it’s not a text in the sense that Paradise Lost or an email is a text. The most efficient way to parse a programming language is as a programming language. Treating it as an English text will loose vital structure, and wrongly try to impose a foreign structure.

But what if you have two computer programs? That’s the problem I’ve been thinking about. I have code in two very different programming languages, and I’d like to know how functions in one code base relate to those in the other. The connections are not ones that a compiler could find. The connections are more psychological than algorithmic. I’d like to reverse engineer, for example, which function in language A a developer had in mind when he wrote a function in language B.

Both code bases are in programming language, but the function names are approximately natural language. If a pair of functions have the same name in both languages, and that name is not generic, then there’s a good chance they’re related. And if the names are similar, maybe they’re related.

I’ve done this sort of thing informally forever. I imagine most programmers do something like this from time to time. But only recently have I needed to do this on such a large scale that proceeding informally was not an option. I wrote a script to automate some of the work by looking for fuzzy matches between function names in both languages. This was far from perfect, but it reduced the amount of sleuthing necessary to line up the two sets of source code.

Around a year ago I had to infer which parts of an old Fortran program corresponded to different functions in a Python program. I also had to infer how some poorly written articles mapped to either set of source code. I did all this informally, but I wonder now whether NLP might have sped up my detective work.

Another situation where natural language processing could be helpful in software engineering is determining code authorship. Again this is something most programmers have probably done informally, saying things like “I bet Bill wrote this part of the code because it looks like his style” or “Looks like Pat left her fingerprints here.” This could be formalized using NLP techniques, and I imagine it has been. Just as Frederick Mosteller and colleagues did a statistical analysis of The Federalist Papers to determine who wrote which paper, I’m sure there have been similar analyses to try to find out who wrote what code, say for legal reasons.

Maybe this already has a name, but I like “unnatural language processing” for the application of natural language processing to unnatural (i.e. programming) languages. I’ve done a lot of ad hoc unnatural language processing, and I’m curious how much of it I could automate in the future.

Related NLP posts