Natural language processing and unnatural text

I recently evaluated two software applications designed to find PII (personally identifiable information) in free text using natural language processing. Both failed badly, passing over obvious examples of PII. By contrast, I also tried natural language processing software on a nonsensical poem, it the software did quite well.

Doctor’s notes

It occurred to me later that the software packages to search for PII probably assume “natural language” has the form of fluent prose, not choppy notes by physicians. The notes that I tested did not consist of complete sentences marked up with grammatically correct punctuation. The text may have been transcribed from audio.

Some software packages deidentify medical notes better than others. I’ve seen some work well and some work poorly. I suspect the former were written specifically for their purpose and the latter were more generic.

Jabberwocky

I also tried NLP software on Lewis Carroll’s poem Jabberwocky. It too is unnatural language, but in a different sense.

Jabberwocky uses nonsense words that Carroll invented for the poem, but otherwise it is grammatically correct. The poem is standard English at the level of structure, though not at the level of words. It is the opposite of medical notes that are standard English at the word level (albeit with a high density of technical terms), but not at a structural level.

I used the spaCy natural language processing library on a couple stanzas from Lewis’ poem.

“Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!”

He took his vorpal sword in hand;
Long time the manxome foe he sought—
So rested he by the Tumtum tree
And stood awhile in thought.

I fed the lines into spaCy and asked it to diagram the lines, indicating parts of speech and dependencies. The software did a good job of inferring the use of even the nonsense words. I gave the software one line at a time rather than a stanza at a time because the latter results in diagrams that are awkwardly wide, too wide to display here. (The spaCy visualization software has a “compact” option, but this option does not make the visualizations much more compact.)

Here are the visualizations of the lines.








And here is the Python code I used to create the diagrams above.

    import spacy
    from spacy import displacy
    from pathlib import Path
    
    nlp = spacy.load("en_core_web_sm")
        
    lines = [
        "Beware the Jabberwock, my son!",
        "The jaws that bite, the claws that catch!",
        "Beware the Jubjub bird",
        "Shun the frumious Bandersnatch!",
        "He took his vorpal sword in hand.",
        "Long time the manxome foe he sought",
        "So rested he by the Tumtum tree",
        "And stood awhile in thought."
    ]
    
    for line in lines:
        doc = nlp(line)
        svg = displacy.render(doc, style="dep", jupyter=False)    
        file_name = '-'.join([w.text for w in doc if not w.is_punct]) + ".svg"
        output_path = Path(file_name)
        output_path.open("w", encoding="utf-8").write(svg)

Related posts

How rare is it to encounter a rare word?

biáng

I recently ran across a paper on typesetting rare Chinese characters. From the abstract:

Written Chinese has tens of thousands of characters. But most available fonts contain only around 6 to 12 thousand common characters that can meet the needs of everyday users. However, in publications and information exchange in many professional fields, a number of rare characters that are not in common fonts are needed in each document.

There’s sort of a paradox here: the author is saying it’s common to need rare words. Aren’t rare words, you know, rare? Of course they are, but the chances of needing some rare word, not just a particular rare word, can be large, particularly in lengthy documents.

This post gives a sort of back-of-the-envelope calculation to justify the preceding paragraph.

Word frequencies often approximately follow Zipf’s law, where the frequency of the nth most common word is proportional to n raised to some negative power s. I’ve seen estimates that there are around N = 50,000 characters in Chinese, but that 1,000 characters make up about 90% of usage. This would correspond to a value of s around 1.25.

In practice, Zipf’s law, like all power laws, fits better over some parts of its range than others. We’re making a simplifying assumption by applying Zipf’s law to the entire vocabulary of Chinese, but this post isn’t trying to precisely model Chinese character frequency, only to show that the statement quoted above is plausible.

With our Zipf’s law model, the 10,000th most common character in Chinese would appear about 2 times in a million characters. But the frequency of all the words from the 10,000th most common to the 50,000th most common would be about 0.03.

So if we list all characters in order of frequency and call everything after the 10,000th position on the list rare, the combined frequency of all rare words is quite high, about 3%. To put it another way, a document of 1,000 words would likely contain around 30 rare words, according to the simplified model presented here.

Related posts

[1] The Chinese character at the top of the post comes from here. According to the source, “The Chinese character ‘biáng’ used to represent Biang Biang noodles, is one of the most complex and rare Chinese characters. It has 56 strokes and cannot be found in modern dictionaries or entered into computers.”

How an LLM might leak medical data

Machine learning models occasionally memorize training data. Under the right prompt, a model could return portions of the training data verbatim.

If a large language model is trained on deidentified medical data, along with data that overlaps with the medical data, it could potentially leak details of a person’s medical history. I’m not saying that any particular LLM does leak data, but it’s something to watch out for if you are training an LLM.

One way to try to create a data leak would be to prompt the LLM with the data you have and see whether it will fill in more. For example, I asked Bing AI

Tell me about a man from Arkansas with ICD-10 code C50.921.

It came back with the following.

… Once upon a time, there was a man named John who lived in Arkansas. John was a hardworking man who loved spending time with his family and friends. One day, John noticed a lump in his right breast. He went to the doctor and was diagnosed with malignant neoplasm of unspecified site of the right male breast, classified under the ICD-10 code C50.921. …

I chose this diagnosis code because male breast cancer is extremely rare, and so the risk of reidentification is higher than for a more common diagnosis.

The model filled in the detail that the man’s name is John. This is not at all suspicious since John is the one of the most common male names in history. If the man’s name had been Vivek, there would be more cause to suspect the model is telling us about a real man name Vivek, though of course that could be a randomly chosen male name.

If a neural network were training on deidentified medical data, it could link fields together. If the story above had said “John, aged 42, …” the age might have been pulled from an actual patient record.

If the data the network was trained on was deidentified well, even leaking data verbatim should not create more than a very small risk of identification. However, if the data contained tokens linking the records to publicly available information, such as real estate records—this happens—then our hypothetical LLM might reveal more personal details that could be used to narrow down whose data is being leaked.

Related posts

V-statistics

A few days ago I wrote about U-statistics, statistics which can be expressed as the average of a symmetric function over all combinations of elements of a set. V-statistics can be written as an average of over all products of elements of a set.

Let S be a statistical sample of size n and let h be a symmetric function of r elements. The average of h over all subsets of S with r elements is a U-statistic. The average of h over the Cartesian product of S with itself r times

\underbrace{S \times S \times \cdots \times S}_{n \text{ times}}

is a V-statistic.

As in the previous post, let h(x, y) = (xy)²/2. We can illustrate the V-statistic associated with h with Python code as before.

    import numpy as np
    from itertools import product

    def var(xs):
        n = len(xs)
        h = lambda x, y: (x - y)**2/2
        return sum(h(*c) for c in product(xs, repeat=2)) / n**2

    xs = np.array([2, 3, 5, 7, 11])
    print(np.var(xs))
    print(var(xs))

This time, however, we iterate over product rather than over combinations. Note also that at the bottom of the code we print

   np.var(xs)

rather than

   np.var(xs, ddof=1)

This means our code here is computing the population variance, not the sample variance. We could make this more explicit by supplying the default value of ddof.

   np.var(xs, ddof=0)

The point of V-statistics is not to calculate them as above, but that they could be calculated as above. Knowing that a statistic is an average of a symmetric function is theoretically advantageous, but computing a statistic this way would be inefficient.

U-statistics are averages of a function h over all subsamples of S of size r without replacement. V-statistics are averages of h over all subsamples of size r with replacement. The difference between sampling with or without replacement goes away as n increases, and so V-statistics have the same asymptotic properties as U-statistics.

Related posts

Filtering on how words are being used

Yesterday I wrote about how you could use the spaCy Python library to find proper nouns in a document. Now suppose you want to refine this and find proper nouns that are the subjects of sentences or proper nouns that are direct objects.

This post was motivated by a project in which I needed to pull out company names from a large amount of text, and it was important to know how the company name was being used.

Dependency labels

Tokens in spaCy have a dependency label attribute dep (or dep_ for its string representation). Dependency labels tell you how a word is being used. For example, dobj tells you the word is being used as a direct object, and nsubj tells you its being used as a nominal subject.

In yesterday’s post the line

    if tok.pos_ == "PROPN":
        print(tok)

filtered tokens to look for proper nouns. We could modify the script to also tell us how the proper nouns are being used by printing tok.dep_.

There are three proper nouns in the opening paragraph of Moby Dick: Ishmael, November, and Cato.

Call me Ishmael. … whenever it is a damp, drizzly November in my soul … With a philosophical flourish Cato throws himself upon his sword …

If we run

    if tok.pos_ == "PROPN":
        print(tok, tok.dep_)

on the first paragraph we get

    Ishmael oprd
    November attr
    Cato nsubj

but it’s not obvious what the output means. If we wrap tok.dep_ with spacy.explain we get a more verbose explanation.

    Ishmael object predicate
    November attribute
    Cato nominal subject

Pulling out subjects

Now suppose we wanted to pull out words that are subjects. We could filter on tok.dep_ == "nsubj" but there are more kinds of subjects than just nominal subjects. There are six kinds of subjects:

  1. nsubj: nominal subject
  2. nsubjpass: nominal passive subject
  3. csubj: clausal subject
  4. csubjpass: clausal passive subject
  5. agent: agent
  6. expl: expletive

Finding the range of possible values for dependency labels takes some digging. I don’t believe it’s in the spaCy documentation per se, but if you’re persistent you’ll find a link this list or the paper it came from.

Forever chemicals and blood donation

I saw a headline saying that donating blood lowers the level of forever chemicals in your body. This post will give a back-of-the-envelope calculation to show that this idea is plausible.

Suppose there are chemicals in your bloodstream that do not break down and that your body will not filter out. Suppose you have about 5 liters of blood and you donate 500ml of blood at a time, 10% of your total blood volume.

Presumably the blood you donate has the same proportion of forever chemicals as the blood you keep, so you lose 10% of your forever chemicals in a blood donation.

Assume you don’t absorb more forever chemicals after you start donating blood, and your body replaces the donated blood with new blood free of forever chemicals.

The quantity of forever chemicals in your blood after n donations is 0.9n times the original amount. How many donations would it take to reduce your level of forever chemicals by half?

We need to solve

0.9n = 0.5.

Taking logs we find

n = log(0.5)/log(0.9) = 6.58.

So after 7 donations, you should have reduced the level of forever chemicals in your blood by about a half. Assuming you donate every 8 weeks, this would take a little over a year.

This is just a simplistic calculation. The result could be inaccurate or even entirely incorrect for any number of reasons. But it does show that the idea that blood donation lowers forever chemical levels is plausible.

Searching for proper nouns

Suppose you want to find all the proper nouns in a document. You could grep for every word that starts with a capital letter with something like

    grep '\b[A-Z]\w+'

but this would return the first word of each sentence in addition to the words you’re after.

You could grep for capitalized words that are not preceded by a period or question mark followed by a space.

    grep -P '(?<![.?] )\b[A-Z]\w+'

That’s possibly better, but it misses proper nouns at the beginning of a sentence.

You might be able to accomplish what you’re after by tinkering with regular expressions, but it would be better to use a library that has some idea of what a proper noun is.

NLP with spaCy

The Python natural language processing library spaCy classifies words by part of speech, and so could in particular search for proper nouns.

Here’s an example using the opening lines of Moby Dick.

    import spacy
    nlp = spacy.load("en_core_web_lg")

    doc = nlp("Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul ... I account it high time to get to sea as soon as I can.")

    for tok in doc:
        if tok.pos_ == "PROPN":
            print(tok)

This will print Ishmael and November only. It does not print words at the beginning of a sentence such as Call or Some even though they are capitalized. When spaCy got to the the line

Queequeg was George Washington cannibalistically developed.

it detected that Queequeg is a proper noun. Presumably the model can tell this from context, because the word precedes the verb was and not because it knows Queeqeug is proper name.

When I changed November to november spaCy was still able to detect that november was a proper noun. When I downcased Ishmael it did not detect that ishmael was a proper noun, presumably because Ishmael is an uncommon name. When I changed the text to “Call me tim” the library did recognize tim as a proper noun.

When I fed spaCy the sentence

I never go as a passenger; nor, though I am something of a salt, do I ever go to sea as a Commodore, or a Captain, or a Cook.

the library thought that Commadore, Captain, and Cook were proper nouns. If I downcase these words, spaCy does not flag them as proper nouns.

When processing the line

For as in this world,head winds are far more prevalent than winds from astern (that is, if you never violate the Pythagorean maxim), so for the most part the Commodore on the quarter-deck gets his atmosphere at second hand from the sailors on the forecastle

spaCy correctly flagged Commodore as a proper noun in this instance. Also, it did not classify Pythagorean as a proper noun; the word is proper but not a noun, i.e. it’s a proper adjective.

TANSTAAFL

My script above has only six lines of code. But it depends on a library that uses a 588 MB language model. [1]

Related posts

[1] “TANSTAALF” stands for “There ain’t no such thing as a free lunch.” It comes from The Moon is a Harsh Mistress by Heinlein.

Incidentally, when I fed “The term TANSTAAFL comes from The Moon is a Harsh Mistress by Heinlein.” to spaCy, it flagged Harsh and Mistress as proper nouns.

When I fed it “The term TANSTAAFL comes from ‘The moon is a harsh mistress’ by Heinlein.” the library correctly tagged harsh as an adjective and mistress as a (non-proper) noun.

Moments of Tukey’s g-and-h distribution

John Tukey developed his so-called g-and-h distribution to be very flexible, having a wide variety of possible values of skewness and kurtosis. Although the reason for the distribution’s existence is its range of possible skewness and values, calculating the skewness and kurtosis of the distribution is not simple.

Definition

Let φ be the function of one variable and four parameters defined by

\varphi(x; a, b, g, h) = a + b\left( \frac{\exp(gx) - 1}{g} \right) \exp(hx^2/2)

A random variable Y has a g-and-h distribution if it has the same distribution as φ(Z; a, b, g, h) where Z is a standard normal random variable. Said another way, if Y has a g-and-h distribution then the transformation φ-1 makes the data normal.

The a and b parameters are for location and scale. The name of the distribution comes from the parameters g and h that control skewness and kurtosis respectively.

The transformation φ is invertible but φ-1 does not have a closed-form; φ-1 must be computed numerically. It follows that the density function for Y does not have a closed form either.

Special cases

The g distribution is the g-and-h distribution with h = 0. It generalizes the log normal distribution.

The limit of the g-and-h distribution as g does to 0 is the h distribution.

If g and h are both zero we get the normal distribution.

Calculating skewness and kurtosis

The following method of computing the moments of Y comes from [1].

Define f by

f(g, h, i) = \frac{1}{g^i\sqrt{1 - ih}} \sum_{r=0}^i \binom{i}{r} \exp\left(\frac{((i-r)g)^2}{2(1-ih)}\right)

Then the raw moments of Y are given by

\text{E} \, Y^m = \sum_{i=0}^m \binom{m}{i} a^{m-i}b^i f(g,h,i)

Skewness is the 3rd centralized moment and kurtosis is the 4th centralized moment. Equations for finding centralized moments from raw moments are given here.

Related posts

[1] James B. McDonald and Patrick Turley. Distributional Characteristics: Just a Few More Moments. The American Statistician, Vol. 65, No. 2 (May 2011), pp. 96–103

Symmetric functions and U-statistics

A symmetric function is a function whose value is unchanged under every permutation of its arguments. The previous post showed how three symmetric functions of the sides of a triangle

  • a + b + c
  • ab + bc + ac
  • abc

are related to the perimeter, inner radius, and outer radius. It also mentioned that the coefficients of a cubic equation are symmetric functions of its roots.

This post looks briefly at symmetric functions in the context of statistics.

Let h be a symmetric function of r variables and suppose we have a set S of n numbers where nr. If we average h over all subsets of size r drawn from S then the result is another symmetric function, called a U-statistic. The “U” stands for unbiased.

If h(x) = x then the corresponding U-statistic is the sample mean.

If h(x, y) = (xy)²/2 then the corresponding U-function is the sample variance. Note that this is the sample variance, not the population variance. You could see this as a justification for why sample variance as an n−1 in the denominator while the corresponding term for population variance has an n.

Here is some Python code that demonstrates that the average of (xy)²/2 over all pairs in a sample is indeed the sample variance.

    import numpy as np
    from itertools import combinations

    def var(xs):
        n = len(xs)
        bin = n*(n-1)/2    
        h = lambda x, y: (x - y)**2/2
        return sum(h(*c) for c in combinations(xs, 2)) / bin

    xs = np.array([2, 3, 5, 7, 11])
    print(np.var(xs, ddof=1))
    print(var(xs))

Note the ddof term that causes NumPy to compute the sample variance rather than the population variance.

Many statistics can be formulated as U-statistics, and so numerous properties of such statistics are corollaries general results about U-statistics. For example U-statistics are asymptotically normal, and so sample variance is asymptotically normal.

Relating perimeter, inner radius, outer radius, and sides of a triangle

Suppose a triangle T has sides a, b, and c.

Let s be the semi-perimeter, i.e. half the perimeter.

Let r be the inner radius, the radius of the largest circle that can fit inside T.

Let R be the outer radius, the radius of the smallest circle that can enclose T.

Then three simple equations relate a, b, c, s, r, and R.

\begin{align*} a + b + c &= 2s \\ ab + bc + ac &= s^2 + r^2 +4rR \\ abc &= 4Rrs \end{align*}

Given a, b, and c, use the first equation to solve for s, then the third equation for Rr, then the second for r, then go back to the last equation to find R.

Given s, r, and R, you can calculate the right hand sides of the three equations above, which are the coefficients in a cubic equation for the sides a, b, and c.

x^3 - (2s)x^2 + (s^2 + r^2 + 4Rr)x -(4Rrs)= 0

Note that this last statement is not about triangles per se. It’s a consequence of

(x-a)(x-b)(x-c) = x^3 - (a+b+c)x^2 + (ab + bc + ac) -abc

which would be true even if a, b, and c were not the sides of a triangle. But since they are sides of a triangle here, the coefficients can be interpreted in terms of geometry, namely in terms of perimeter, inner radius, and outer radius.

Related posts