Unix linguistics

Posted on 27 November 2023 by John

If you knew that you wanted to learn 10 spoken languages, it would probably be helpful to take a course in linguistics first. Or maybe to have a linguistics course after learning your first or second language. And if the languages are related, it would help to know something about the linguistics of that group of languages in particular. For example, if you wanted to learn several Romance languages, it might be worthwhile to learn at least some Latin first, even if Latin isn’t on the list of languages you want to learn.

In order to become fluent in using the Unix (Linux) command line, you need to learn a dozen related languages. Fortunately none of these languages are anywhere near as large as a spoken language, but there are several of them. Regular expressions, for example, are a pattern description language. You can think of vim as a language. And of course programming languages like sed and awk are languages.

As you use various command line utilities you notice similarities between them. Some tool history is fairly well known. For example, it’s well known that grep takes its name from the g/re/p command in ed, and that grep was created by modifying the ed source code. The history of sed is similar. The line editor ed is a common ancestor of grep, sed, and vi, which explains a lot of the similarity between these tools.

There is a large amount of preserved Unix history, but what I have in mind is more linguistics than history. History often accounts for the similarities in syntax, but I’m primarily interested in the similarities themselves rather than the history. A semi-fictional history might be more useful than an accurate history. “This isn’t exactly how things came about, but you could imagine …”

I’ve seen bits and pieces of what I imagine would comprise a course in Unix linguistics. For example, there is a section in the book sed & awk entitled “Awk, by Sed and Grep, out of Ed.”

I’ve used Emacs since college, but I’m learning how to get by in vi. Part of my motivation is to be able to log into a client’s Linux box and be productive without being able to install or configure anything. Although I hardly know anything about vi at this point, I can tell right away that vi has more syntactic similarity to the rest of the Unix ecosystem than Emacs does.

It would be really nice to have a book with a title like “vi for people who have used sed, grep, and less.” Or even better, a tutor who could relate what I don’t know to what I do know. Someone like my Greek professor.

I took one semester of classical Greek in college. The professor, William Nethercut, was amazing. At the beginning of the semester he asked each student what languages they had studied, and customized the rest of the course accordingly. “This feature of Greek is like that feature in French, Susan. And like this feature of Latin, Mike.” I was impressed by his erudition in languages, but even more impressed with his thoughtfulness in relating to each of us individually. If Dr. Nethercut taught a class in the Unix ecosystem, he could say “So, you know this set of tools and want to learn that set of tools. You’ll find that the syntax of this is similar to the syntax of that, but watch out for this difference.”

Code to convert words to Major system numbers

Posted on 5 August 2023 by John

A few days ago I wrote about using the CMU Pronouncing Dictionary to search for words that decode to certain numbers in the Major mnemonic system. You can find a brief description of the Major system in that post.

As large as the CMU dictionary is, it did not contain words mapping to some three-digit numbers, so it would be good to explore a larger, or at least different, dictionary. But the CMU dictionary is apparently the largest dictionary with pronunciation openly available.

To get more pronunciation data, you’ll need to generate it. This is what linguists call the grapheme to phoneme problem. There are software packages that create phonetic spellings using large neural network models, including models trained on the CMU data.

Why quick-and-dirty is OK

However, it’s possible to do a good enough job with much simpler software. There are several reasons why we don’t need the sophistication of research software. First and foremost, we can tolerate errors. If we get a few false positives, we can skim through those and ignore them. And if we get a few false negatives, that’s OK as long as we find a few of the words we’re looking for.

Another thing in our favor is that we’re not looking for pronunciation per se, only the numbers generated from the pronunciation. The hardest part of the grapheme to phoneme problem is vowel sounds, and we don’t care about vowel sounds at all. And we don’t care about distinguishing, for example, between voiced and unvoiced variations on the th sound because they both map to 1.

Code

The Major mnemonic system is based on pronunciation, not spelling. Nevertheless, you can do a rough-and-ready conversion, adequate for our purposes, based on spelling. I take into account a minimal amount of context, such as noting that c is soft before i, e, and y, but hard before a, o, and u. The handling of ch is probably biggest source of errors because the sound of ch depends on etymology.

I wrote this as a Python script initially because I wanted to share it with someone who knows Python. But I’ll present it here in Perl because the Perl code is much more compact.

sub word2num {
    local $_ = shift;
    
    tr/A-Z/a-z/; # lower case
    
    s/ng/n/g;
    s/sch/j/g;
    s/che/k/g;
    s/[cs]h/j/g;
    s/g[iey]/j/g; # soft g -> j
    s/c[eiy]/s/g; # soft c -> s
    s/c[aou]/k/g; # hard c -> k
    s/ph/f/g;
    s/([bflmprv])\1+/\1/g; # condense double letters
    s/qu/k/g;
    s/x/ks/g;

    tr/szdnmrljgkfvpb/00123456778899/;
    tr/a-z//d; # remove remaining letters
    
    return $_
}

Perl has implicit variables, for better and for worse, and here it’s for the better. All the translation (tr//) and substitution (s//) operate in place on the implicit argument, the word sent to the function.

The corresponding Python code is more verbose:

def word2num(w):

    w = w.lower()

    w = w.replace('ng', 'n')
    w = w.replace('sch', 'j')
    w = w.replace('che', 'k')

    for x in ['gi', 'ge', 'gy', 'ch', 'sh']:
        w = w.replace(x, 'j')

    ...

The order of the replacement statements matters. For example, you want to decide whether c and g are hard before you discard the vowels.

This script works better than I expected it would for being such a dirty hack. I ran it on some large word lists looking for more alternatives to the three-digit numbers not in the output of the script processing the CMU dictionary. I list a few of the words I found here. The most amusing find was phobophobia, the fear of phobias, for 898.

Aside from filling in gaps in three-digit numbers, you could also use a script like this to search for mnemonic words in specialized lists of words, such as baseball players, or animal species, or brand names.

ARPAbet and the Major mnemonic system

Posted on 29 July 2023 by John

Giraffe

ARPAbet is a phonetic spelling system developed by— you guessed it—ARPA, before it became DARPA.

The ARPAbet system is less expressive than IPA, but much easier for English speakers to understand. Every sound is encoded as one or two English letters. So, for example, the sound denoted ʒ in IPA is ZH in ARPAbet.

In ARPAbet notation, the Major mnemonic system can be summarized as follows:

0: S or Z
1: D, DH, T, or DH
2: N or NG
3: M
4: R
5: L
6: CH, JH, SH, or ZH
7: G or K
8: F or V
9: P or B

Numbers are encoded using the consonant sounds above; the system is based on sounds and not on spelling. You can insert any vowels or semivowels (e.g. w or y) you like. For example, you could encode 648 as “giraffe” or 85 as “waffle.”

The CMU Pronouncing Dictionary lists 134,373 words along with their ARPAbet pronunciation. The Python code below will read in the pronouncing dictionary and produce a Major mnemonic dictionary. The resulting file is available here as a zip compressed text file.

To find a word that encodes a number, search the code output for that number. For example,

    grep ' 648' cmu_major.txt

will find words whose Major encoding begins with 648, and

    grep ' 648$' cmu_major.txt

fill find words whose Major encoding is exactly 648.

From this we learn that “sheriff” is another possible encoding for 648.

Filling in the gaps

Suppose you’re looking for encodings for all three digit numbers, 000 through 999. This can be hard to do. A common compromise is to only regard up to the first three consonants in a word. For example, you might use “ladybug” to encode 519, ignoring the final G sound on the end.

The tradeoff is that if you adopt this rule then you can’t use “ladybug” to encode 5197. But finding single words that encode 4-digit numbers can be challenging if not impossible, so you may just forego the possibility. (I quantify this here.) This is why in the example above I show both searching for numbers that begin with 648 and numbers that are exactly 648.

Despite the large size of the CMU dictionary, it does not contain words that map to numbers beginning with 42 three-digit numbers. I can offer suggestions for these numbers, but it’s hard to use anyone else’s mnemonics. You may have to make up your own, using, for example, names of people you know personally or brand names you’re familiar with etc.

Python code

# NB: File encoding is Latin-1, not UTF-8.
with open("cmudict-0.7b", "r", encoding="latin-1") as f:
    lines = f.readlines()

for line in lines:
    line.replace('0','') # remove stress notation
    line.replace('1','')
    line.replace('2','')
    
    pieces = line.split()
    numstr = ""
    for p in pieces[1:]:
        match p:
            case "S" | "Z":
                numstr += "0"
            case "D" | "DH" | "T" | "DH":
                numstr += "1"
            case "N" | "NG":
                numstr += "2"
            case "M":
                numstr += "3"
            case "R":
                numstr += "4"
            case "L":
                numstr += "5"
            case "CH" | "JH" | "SH" | "ZH":
                numstr += "6"
            case "G" | "K":
                numstr += "7"
            case "F" | "V":
                numstr += "8"
            case "P" | "B":
                numstr += "9"
    print(pieces[0], numstr)

Voiced and unvoiced consonants and digits

Posted on 23 May 2022 by John

The latest episode of The History of English Podcast discusses the history of pronunciation changes in the Elizabethan period. The episode has a lot to say about the connections between voiced and unvoiced pairs of consonants, and the circumstances under which a consonant might change from voiced to unvoiced and vice versa.

The major mnemonic system encodes digits as consonant sounds in order to make words out of numbers. The system works in terms of sounds, not spellings, and so some of the symbols below are not English letters but rather IPA symbols [1]. More on the system here.

0: S Z, 1: T D ð θ, 2: N ŋ, 3: M, 4: R, 5: L, 6: ʤ ʧ ʃ ʒ, 7: K G, 8: F V, 9: P B

If you’re not familiar with the concept of voiced and unvoiced vocal sounds it may seem arbitrary that, for example, the F and V sounds both decode to 8, or that the S and SH sounds map to different numbers, 0 and 6 respectively.

The allocation of sounds may seem inefficient at first.Some numbers get more sounds than others because some sounds belong to clusters of related sounds and some do not. For example, there’s no such thing as an unvoiced L sound, so 5 gets L and no other sound. But 8 gets P and B because these are unvoiced and voiced variations of the same sound.

The allocation is more uniform than it seems at first when you count consonant groups rather than individual consonant sounds.

[1] Here are the IPA symbols above that do not correspond to English letters.

|-----+---------|
| IPA | Example |
|-----+---------|
| ð   | THis    |
| θ   | THistle |
| ŋ   | siNG    |
| ʤ   | Jar     |
| ʧ   | CHurCH  |
| ʃ   | SHoe    |
| ʒ   | corsaGe |
|-----+---------|

For more on IPA, see the Wikipedia IPA help page.

It seemed like a good idea at the time

Posted on 2 November 2021 by John

“Things are the way they are because they got that way … one logical step at a time.” — Gerald Weinberg

English spelling is notoriously difficulty. It is the result of decisions that, while often unfortunate, were not unreasonable at the time.

Sometimes pronunciation simplified but spelling remained unchanged. For example, originally all the letters in knife were pronounced. In software development lingo, some spellings were retained for backward compatibility.

Sometimes pronunciation was chosen to reflect etymology. This seems like a strange choice now, but it made more sense at a time when Latin and French were far more commonly known in England, and a time when loan words were pouring into English. These choices have turned out to be unfortunate, but they were not malicious.

For more on this story, see Episode 153: Zombie Letters from The History of English Podcast.

Hebrew letters spotted in applied math

Posted on 23 June 2021 by John

Math and physics use Greek letters constantly, but seldom do they use letters from any other alphabet.

The only Cyrillic letter I recall seeing in math is sha (Ш, U+0428) for the so-called Dirac comb distribution.

One Hebrew letter is commonly used in math, and that’s aleph (א, U+05D0). Aleph is used fairly often, but other Hebrew letters are much rarer. If you see any other Hebrew letter in math, it’s very likely to be one of the next three letters: beth (ב, U+05D1), gimel (ג, U+05D2), or dalet (ד, U+05D3).

To back up this claim, basic LaTeX only has a command for aleph (unsurprisingly, it’s \aleph). AMS-LaTeX adds the commands \beth, \gimel, and \daleth, but no more. Those are the only Hebrew letters you can use in LaTeX without importing a package or using XeTeX so you can use Unicode symbols.

Not only are Hebrew letters rare in math, the only area of math that uses them at all is set theory, where they are used to represent transfinite numbers.

So in short, if you see a Hebrew letter in math, it’s overwhelmingly likely to be in set theory, and it’s very likely to be aleph, or possibly beth, gimel, or dalet.

But today I was browsing through Morse and Feschbach and was very surprised to see the following on page 324.

gimel = lambda ayin + mu yod + mu yod star

I’ve never seen a Hebrew letter in applied math, and I’ve never seen ayin (ע, U+05E2) or yod (י, U+05D9) used anywhere in math.

In context, the authors had used Roman letters, Fraktur letters, and Greek letters and so they’d run out of alphabets. The entity denoted by gimel is related to a tensor the authors denoted with g, so presumably they used the Hebrew letter that sounds like “g”. But I have no idea why they chose ayin or yod.

All English vowel sounds in one sentence

Posted on 20 August 2020 by John

Contrary to popular belief, English has more than five or ten vowel sounds. The actual number is disputed because of disagreements over when two sounds are sufficiently distinct to be classified as separate sounds. I’ve heard some people say 15, some 17, some over 20.

I ran across a podcast episode recently that mentioned a sentence that demonstrates a different English vowel sound in each word:

Who would know naught of art must learn, act, and then take his ease [1].

The hosts noted that to get all the vowels in, you need to read the sentence with non-rhotic pronunciation, i.e. suppressing the r in art.

I’ll run this sentence through some software that returns the phonetic spelling of each word in IPA symbols to see the distinct vowel sounds that way. First I’ll use Python, then Mathematica.

Python

Let’s run this through some Python code that converts English words to IPA notation so we can look at the vowels.

    import eng_to_ipa as ipa

    text = "Who would know naught of art must learn, act, and then take his ease."
    print(ipa.convert(text))

This gives us

hu wʊd noʊ nɔt əv ɑrt məst lərn, ækt, ənd ðɛn teɪk hɪz iz

Which includes the following vowel symbols:

This has some duplicates: 5, 7, 8, and 10 are all schwa symbols.

By default the eng_to_ipa gives one way to write each word in IPA notation. There is an optional argument, retrieve_all that defaults to False but may return more alternatives when set to True. However, in our example the only difference is that the second alternative writes and as ænd rather than ənd.

It looks like the eng_to_ipa module doesn’t transcribe vowels with sufficient resolution to distinguish some of the sounds in the model sequence. For example, it doesn’t seem to distinguish the stressed sound ʌ from the unstressed ə.

Mathematica

Here’s Mathematica code to split the model sentence into words and show the IPA pronunciation of each word.

    text = "who would know naught of art must \
        learn, act, and then take his ease" 
    ipa[w_] := WordData[w, "PhoneticForm"]
    Map[ipa, TextWords[text]]

This returns

    {"hˈu", "wˈʊd", "nˈoʊ", "nˈɔt", "ˈʌv", "ˈɒrt", "mˈʌst", 
    "lˈɝn", "ˈækt", "ˈænd", "ðˈɛn", "tˈeɪk", "hˈɪz", "ˈiz"}

By the way, I had to write the first word as “who” because WordData won’t do it for me. If you ask for

    ipa["Who"]

Mathematica will return

    Missing["NotAvailable"]

though it works as expected if you send it “who” rather than “Who.”

Let’s remove the stress marks and join the words together so we can compare the Python and Mathematica output. The top line is from Python and the bottom is from Mathematica.

    hu wʊd noʊ nɔt əv ɑrt məst lərn ækt ænd ðɛn teɪk hɪz iz
    hu wʊd noʊ nɔt ʌv ɒrt mʌst lɝn  ækt ænd ðɛn teɪk hɪz iz

There are a few differences, summarized in the table below. Since the symbols are a little difficult to tell apart, I’ve included their Unicode code points.

    |-------+------------+-------------|
    | Word  | Python     | Mathematica |
    |-------+------------+-------------|
    | of    | ə (U+0259) | ʌ (U+028C)  |
    | must  | ə (U+0259) | ʌ (U+028C)  |
    | art   | ɑ (U+0251) | ɒ (U+0252)  |
    | learn | ə (U+0259) | ɝ (U+025D)  |
    |-------+------------+-------------|

Mathematica makes some distinctions that Python missed.

Update: See the first comment below for variations on how the model sentence can be pronounced and how to get more distinct vowel sounds out of it.

How fast were dead languages spoken?

Posted on 5 September 2019 by John

A new paper in Science suggests that all human languages carry about the same amount of information per unit time. In languages with fewer possible syllables, people speak faster. In languages with more syllables, people speak slower.

Researchers quantified the information content per syllable in 17 different languages by calculating Shannon entropy. When you multiply the information per syllable by the number of syllables per second, you get around 39 bits per second across a wide variety of languages.

If a language has N possible syllables, and the probability of the ith syllable occurring in speech is p_i, then the average information content of a syllable, as measured by Shannon entropy, is

$-\sum_{i=1}^N p_i \log_2 p_i$

For example, if a language had only eight possible syllables, all equally likely, then each syllable would carry 3 bits of information. And in general, if there were 2ⁿ syllables, all equally likely, then the information content per syllable would be n bits. Just like n zeros and ones, hence the term bits.

Of course not all syllables are equally likely to occur, and so it’s not enough to know the number of syllables; you also need to know their relative frequency. For a fixed number of syllables, the more evenly the frequencies are distributed, the more information is carried per syllable.

If ancient languages conveyed information at 39 bits per second, as a variety of modern languages do, one could calculate the entropy of the language’s syllables and divide 39 by the entropy to estimate how many syllables the speakers spoke per second.

According to this overview of the research,

Japanese, which has only 643 syllables, had an information density of about 5 bits per syllable, whereas English, with its 6949 syllables, had a density of just over 7 bits per syllable. Vietnamese, with its complex system of six tones (each of which can further differentiate a syllable), topped the charts at 8 bits per syllable.

One could do the same calculations for Latin, or ancient Greek, or Anglo Saxon that the researches did for Japanese, English, and Vietnamese.

If all 643 syllables of Japanese were equally likely, the language would convey -log₂(1/637) = 9.3 bits of information per syllable. The overview says Japanese carries 5 bits per syllable, and so the efficiency of the language is 5/9.3 or about 54%.

If all 6949 syllables of English were equally likely, a syllable would carry 12.7 bits of information. Since English carries around 7 bits of information per syllable, the efficiency is 7/12.7 or about 55%.

Taking a wild guess by extrapolating from only two data points, maybe around 55% efficiency is common. If so, you could estimate the entropy per syllable of a language just from counting syllables.

Estimating vocabulary size with Heaps’ law

Posted on 27 August 2019 by John

Heaps’ law says that the number of unique words in a text of n words is approximated by

V(n) = K n^β

where K is a positive constant and β is between 0 and 1. According to the Wikipedia article on Heaps’ law, K is often between 10 and 100 and β is often between 0.4 and 0.6.

(Note that it’s Heaps’ law, not Heap’s law. The law is named after Harold Stanley Heaps. However, true to Stigler’s law of eponymy, the law was first observed by someone else, Gustav Herdan.)

I’ll demonstrate Heaps’ law looking at books of the Bible and then by looking at novels of Jane Austen. I’ll also look at unique words, what linguists call “hapax legomena.”

Demonstrating Heaps’ law

For a collection of related texts, you can estimate the parameters K and β from data. I decided to see how well Heaps’ law worked in predicting the number of unique words in each book of the Bible. I used the King James Version because it is easy to download from Project Gutenberg.

I converted each line to lower case, replaced all non-alphabetic characters with spaces, and split the text on spaces to obtain a list of words. This gave the following statistics:

    |------------+-------+------|
    | Book       |     n |    V |
    |------------+-------+------|
    | Genesis    | 38520 | 2448 |
    | Exodus     | 32767 | 2024 |
    | Leviticus  | 24621 | 1412 |
                    ...
    | III John   |   295 |  155 |
    | Jude       |   609 |  295 |
    | Revelation | 12003 | 1283 |
    |------------+-------+------|

The parameter values that best fit the data were K = 10.64 and β = 0.518, in keeping with the typical ranges of these parameters.

Here’s a sample of how the actual vocabulary size and modeled vocabulary size compare.

    |------------+------+-------|
    | Book       | True | Model |
    |------------+------+-------|
    | Genesis    | 2448 |  2538 |
    | Exodus     | 2024 |  2335 |
    | Leviticus  | 1412 |  2013 |
                    ...
    | III John   |  155 |   203 |
    | Jude       |  295 |   296 |
    | Revelation | 1283 |  1387 |
    |------------+------+-------|

Here’s a visual representation of the results.

KJV bible total words vs distinct words

It looks like the model is more accurate for small books, and that’s true on an absolute scale. But the relative error is actually smaller for large books as we can see by plotting again on a log-log scale.

KJV bible total words vs distinct words

Jane Austen novels

It’s a little surprising that Heaps’ law applies well to books of the Bible since the books were composed over centuries and in two different languages. On the other hand, the same committee translated all the books into English at the same time. Maybe Heaps’ law applies to translations better than it applies to the original texts.

I expect Heaps’ law would fit more closely if you looked at, say, all the novels by a particular author, especially if the author wrote all the books in his or her prime. (I believe I read that someone did a vocabulary analysis of Agatha Christie’s novels and detected a decrease in her vocabulary in her latter years.)

To test this out I looked at Jane Austen’s novels on Project Gutenberg. Here’s the data:

    |-----------------------+--------+------|
    | Novel                 |      n |    V |
    |-----------------------+--------+------|
    | Northanger Abbey      |  78147 | 5995 |
    | Persuasion            |  84117 | 5738 |
    | Sense and Sensibility | 120716 | 6271 |
    | Pride and Prejudice   | 122811 | 6258 |
    | Mansfield Park        | 161454 | 7758 |
    | Emma                  | 161967 | 7092 |
    |-----------------------+--------+------|

The parameters in Heaps’ law work out to K = 121.3 and β = 0.341, a much larger K than before, and a smaller β.

Here’s a comparison of the actual and predicted vocabulary sizes in the novels.

    |-----------------------+------+-------|
    | Novel                 | True | Model |
    |-----------------------+------+-------|
    | Northanger Abbey      | 5995 |  5656 |
    | Persuasion            | 5738 |  5799 |
    | Sense and Sensibility | 6271 |  6560 |
    | Pride and Prejudice   | 6258 |  6598 |
    | Mansfield Park        | 7758 |  7243 |
    | Emma                  | 7092 |  7251 |
    |-----------------------+------+-------|

If a suspected posthumous manuscript of Jane Austen were to appear, a possible test of authenticity would be to look at its vocabulary size to see if it is consistent with her other works. One could also look at the number of words used only once, as we discuss next.

Hapax legomena

In linguistics, a hapax legomenon is a word that only appears once in a given context. The term comes from a Greek phrase meaning something said only once. The term is often shortened to just hapax.

I thought it would be interesting to look at the number of hapax legomena in each book since I could do it with a minor tweak of the code I wrote for the first part of this post.

Normally if someone were speaking of hapax legomena in the context of the Bible, they’d be looking at unique words in the original languages, i.e. Hebrew and Greek, not in English translation. But I’m going with what I have at hand.

Here’s a plot of the number of hapax in each book of the KJV compared to the number of words in the book.

Hapax logemenon in Bible, linear scale

This looks a lot like the plot of vocabulary size and total words, suggesting the number of hapax also follow a power law like Heaps law. This is evident when we plot again on a logarithmic scale and see a linear relation.

Number of hapax logemena on a log-log scale

Just to be clear on the difference between two analyses this post, in the first we looked at vocabulary size, the number of distinct words in each book. In the second we looked at words that only appear once. In both cases we’re counting unique words, but unique in different senses. In the first analysis, unique means that each word only counts once, no matter how many times it’s used. In the second, unique means that a work only appears once.

Writing down an unwritten language

Posted on 24 February 2014 by John

In this post I interview Greg Greenlaw, a friend of mine who served as a missionary to the Nakui tribe in Papua New Guinea and developed their writing system. (Nakui is pronounced like “knock we.”)

JC: When you went to PNG to learn Nakui was there any writing system?

GG: No, they had no way of writing words or numbers. They had names for only seven numbers — that was the extent of their counting system — but they could coordinate meetings more than a few days future by tying an equal number of knots in two vines. Each party would take a vine with them and loosen a knot each morning until they counted down to the appointed time — like and advent calendar, but without numbers!

JC: I believe you said that Nakui has a similar grammar to other languages in PNG but completely different vocabulary. Were you able to benefit from any other translation work in the area? Will your work serve as a starting point for translating another language?

GG: Yes. Missionaries had moved into our general area back in the 80s and the languages they studied had mostly the same grammar features that we saw in the Nakui language. This sped the process, but in truth, the slowest part of learning to speak was not analyzing the grammar, it was training yourself to use it. Each Nakui verb could conjugate 300 different ways depending on person, recipient, tense, aspect, direction, and number.

JC: Do any of the Nakui speak a pidgin or any other language that served as a bridge for you to learn their language?

GG: We were blessed with one very good pidgin speaker when we moved into the village and several others who spoke the Melanesian Tok Pisin at a moderate level. That was a help in eliciting phrases and getting correction, but there were limits to how much they understood about their own speech. They had never thought about their language analytically and could not state a reason for the different suffixes used in the language, they just knew it was the right way to say something. Most peculiar was that they couldn’t even tell where certain word breaks were. Their language had never been “visible” to them. Nouns were easy to separate out, but it was hard to know if certain modifiers were affixes or separate function words.

JC: What are a few things that English speakers would find surprising about the language?

GG: Nakui has very few adjectives but innumerable verbs. They do the majority of their describing by using a very specific verb. There is a different verb for cutting something in the middle as opposed to cutting off a small section or cutting it long ways or cutting something to fall downward or cutting something to bits. The correct verb for the situation will also depend on the object being cut — is it log-like, leaf-like, or flesh-like.

JC: Do you try to learn a new language aurally first before starting on a writing system, or do you start developing a writing system early on so you can take notes?

GG: It’s language, so you always want your ears and tongue to lead your eyes. We use the international phonetic alphabet to write down notes, but we encourage language learners to hold off on writing anything for almost a month. After they are memorizing and pronouncing correctly a host of nouns and some noun modifiers, then we encourage them to start writing down their data with the phonetic alphabet. About half way to fluency they usually have enough perspective on the variety of sounds and where those sounds occur that they can narrow down a useful alphabet.

JC: How long did it take to learn the language? To create the writing system? How many people were working on the project?

GG: Primarily Tim Askew and myself, but there were three other missionaries that worked among the Nakui in the early days and each had some contribution to our understanding. The biggest credit really belongs to God, by whose grace we were allowed to live there and by whose strength we were able to chip away for three years until the job was done.

JC: Can you describe the process you went through to create the writing system?

GG: Well after collecting thousands of words written phonetically we could start a process of regularizing all that data. Looking at the language as a whole we had to decided the sleekest, most uniform way to consistently spell those sounds phonetically. After cleaning up our data we could take a hard look at which of the sounds were unique and which are just modified by surrounding sounds. Phonetically the English language has three different ‘t’ sounds: an aspirated ‘t’, and unaspirated ‘t’, and an unreleased ‘t’. We only think of one ‘t’ sound, though, and it would be confusing (and irritating) if we had three ‘t’ symbols in our alphabet. Our ‘t’s just become unreleased automatically when they are at the end of a word and they become unaspirated when they blend with an ‘s’. You need to narrow down the number of sounds to just the ones that are significant and meaningful. It is only those unique sounds that are given letters in the alphabet.

JC: What is the alphabet you settled on? Do the letters represent sounds similar to what an English speaker would expect?

GG: We found 10 vowel sounds and 10 consonant sounds were needful in the Nakui language. The 10 consonants are symbolized the same as English, but the vowels had to be augmented. We borrowed the letter ‘v’ to become the short ‘u’ sound and used accent marks over other vowels to show their lowered and more nasalized quality.

JC: Could you give an example of something written in Nakui?

GG: This is John 3:16. Músvilv Sisasvyv me iyemvi, “Kotvlv asini niyanú nonú mvli mvlei ufau. Múmvimvilv, tuwani aluwalv siyv tuwayv itii, asinosv. Mvimeni notalv me svmvleliyelv, ‘Sisasvni yáfu kokuimvisv,’ múmeni nonv ba itinvini, i tínonvminvini.

JC: Is it unusual for a language to have as many vowel symbols as consonant symbols?

GG: Good question. I’m not sure what normal would be worldwide, but I know we have a lot more than five vowel sounds in English! All our English vowel symbols have double or triple allophones (long sounds, short sounds, alternate sounds) making it very challenging to instruct my 6 yr old in the skill of reading.

JC: Have you published anything on your work, say in a linguistics journal?

GG: No, there is nothing unique in what we did. Truth is, we organized our findings and reported our work in an informal, non-academic, pragmatic way because our goal was not a PhD in anthropology, but a redeeming ministry in the lives of tribal men and women and ultimately a commendation from the Maker of men.

JC: How many of the Nakui can read their language?

GG: All the young men in our village are readers now, with the exception of a dyslexic man who attempted our literacy course three times without result. About 1/3 of the young women can read. They have less available time for study, and in the early days the village leaders did not allow the girls to be in class with the boys. As a result, literacy for ladies had a slower start and fewer participants. There were older men who tried to learn to read, but none that succeeded. The gray matter seemed to have stiffened.

JC: If English didn’t have a writing system and you were designing one from scratch, what would you do differently?

GG: Wow, we have six kids we’ve taught to read – we would do a lot differently! The ideal is to have one symbol for each meaningful sound in the language. English would need more vowels and fewer consonants. Life would actually be more pleasant without ‘c’, ‘j’, ‘q’ and ‘x’. If you removed those and added about six vowels we could come up with a spelling system that was predictable and easy to learn.

JC: Anything else you’d like to say?

GG: One of the most remarkable things about the Nakui language is how sophisticated and complex it is. Unblended with other languages, it is extremely consistent in its grammar. Although it is in a constant state of change (rapid change relative to written languages of the world), it remains organized. This order is not to the credit of the society that speaks it, it happens utterly in spite of them. They are completely unaware of the structure of their own grammar and are perhaps the most disorderly people on the planet. The aspects of life over which they hold sway are corrupt and miserable, but in this small, unmanaged area of their lives, a glimpse of the divine still peeks out. The God of order, who created them in His own image, gave them an innate ability to communicate complex ideas in and orderly way. Like Greek, they specify singular, dual and plural. They have six tenses (English has only three), and 11 personal pronouns (English has only six). Every verb is conjugated to carry this specificity with extreme consistency. The verb to ‘go’ is the only irregular verb in the whole Nakui language. English, although vast, is a disheveled mess compared to what is spoken by these stone age people in the swamps of interior New Guinea.

Linguistics

Unix linguistics

Code to convert words to Major system numbers

Why quick-and-dirty is OK

Code

ARPAbet and the Major mnemonic system

Filling in the gaps

Python code

Voiced and unvoiced consonants and digits

Related

It seemed like a good idea at the time

Hebrew letters spotted in applied math

Related posts

All English vowel sounds in one sentence

Python

Mathematica

More linguistics posts

How fast were dead languages spoken?

Related posts

Estimating vocabulary size with Heaps’ law

Demonstrating Heaps’ law

Jane Austen novels

Hapax legomena

Related posts

Writing down an unwritten language

Related posts