How fast were dead languages spoken?

A new paper in Science suggests that all human languages carry about the same amount of information per unit time. In languages with fewer possible syllables, people speak faster. In languages with more syllables, people speak slower.

Researchers quantified the information content per syllable in 17 different languages by calculating Shannon entropy. When you multiply the information per syllable by the number of syllables per second, you get around 39 bits per second across a wide variety of languages.

If a language has N possible syllables, and the probability of the ith syllable occurring in speech is pi, then the average information content of a syllable, as measured by Shannon entropy, is

-\sum_{i=1}^N p_i \log_2 p_i

For example, if a language had only eight possible syllables, all equally likely, then each syllable would carry 3 bits of information. And in general, if there were 2n syllables, all equally likely, then the information content per syllable would be n bits. Just like n zeros and ones, hence the term bits.

Of course not all syllables are equally likely to occur, and so it’s not enough to know the number of syllables; you also need to know their relative frequency. For a fixed number of syllables, the more evenly the frequencies are distributed, the more information is carried per syllable.

If ancient languages conveyed information at 39 bits per second, as a variety of modern languages do, one could calculate the entropy of the language’s syllables and divide 39 by the entropy to estimate how many syllables the speakers spoke per second.

According to this overview of the research,

Japanese, which has only 643 syllables, had an information density of about 5 bits per syllable, whereas English, with its 6949 syllables, had a density of just over 7 bits per syllable. Vietnamese, with its complex system of six tones (each of which can further differentiate a syllable), topped the charts at 8 bits per syllable.

One could do the same calculations for Latin, or ancient Greek, or Anglo Saxon that the researches did for Japanese, English, and Vietnamese.

If all 643 syllables of Japanese were equally likely, the language would convey -log2(1/637) = 9.3 bits of information per syllable. The overview says Japanese carries 5 bits per syllable, and so the efficiency of the language is 5/9.3 or about 54%.

If all 6949 syllables of English were equally likely, a syllable would carry 12.7 bits of information. Since English carries around 7 bits of information per syllable, the efficiency is 7/12.7 or about 55%.

Taking a wild guess by extrapolating from only two data points, maybe around 55% efficiency is common. If so, you could estimate the entropy per syllable of a language just from counting syllables.

Related posts

Estimating vocabulary size with Heaps’ law

Heaps’ law says that the number of unique words in a text of n words is approximated by

V(n) = K nβ

where K is a positive constant and β is between 0 and 1. According to the Wikipedia article on Heaps’ law, K is often between 10 and 100 and β is often between 0.4 an 0.6.

(Note that it’s Heaps’ law, not Heap’s law. The law is named after Harold Stanley Heaps. However, true to Stigler’s law of eponymy, the law was first observed by someone else, Gustav Herdan.)

I’ll demonstrate Heaps law looking at books of the Bible and then by looking at novels of Jane Austen. I’ll also look at unique words, what linguists call “hapax legomena.”

Demonsrating Heaps law

For a collection of related texts, you can estimate the parameters K and β from data. I decided to see how well Heaps’ law worked in predicting the number of unique words in each book of the Bible. I used the King James Version because it is easy to download from Project Gutenberg.

I converted each line to lower case, replaced all non-alphabetic characters with spaces, and split the text on spaces to obtain a list of words. This gave the following statistics:

    |------------+-------+------|
    | Book       |     n |    V |
    |------------+-------+------|
    | Genesis    | 38520 | 2448 |
    | Exodus     | 32767 | 2024 |
    | Leviticus  | 24621 | 1412 |
                    ...
    | III John   |   295 |  155 |
    | Jude       |   609 |  295 |
    | Revelation | 12003 | 1283 |
    |------------+-------+------|

The parameter values that best fit the data were K = 10.64 and β = 0.518, in keeping with the typical ranges of these parameters.

Here’s a sample of how the actual vocabulary size and predicted vocabulary size compare.

    |------------+------+-------|
    | Book       | True | Model |
    |------------+------+-------|
    | Genesis    | 2448 |  2538 |
    | Exodus     | 2024 |  2335 |
    | Leviticus  | 1412 |  2013 |
                    ...
    | III John   |  155 |   203 |
    | Jude       |  295 |   296 |
    | Revelation | 1283 |  1387 |
    |------------+------+-------|

Here’s a visual representation of the results.

KJV bible total words vs distinct words

It looks like the predictions are more accurate for small books, and that’s true on an absolute scale. But the relative error is actually smaller for large books as we can see by plotting again on a log-log scale.

KJV bible total words vs distinct words

Jane Austen novels

It’s a little surprising that Heaps’ law applies well to books of the Bible since the books were composed over centuries and in two different languages. On the other hand, the same committee translated all the books at the same time. Maybe Heaps’ law applies to translations better than it applies to the original texts.

I expect Heaps’ law would fit more closely if you looked at, say, all the novels by a particular author, especially if the author wrote all the books in his or her prime. (I believe I read that someone did a vocabulary analysis of Agatha Christie’s novels and detected a decrease in her vocabulary in her latter years.)

To test this out I looked at Jane Austen’s novels on Project Gutenberg. Here’s the data:

    |-----------------------+--------+------|
    | Novel                 |      n |    V |
    |-----------------------+--------+------|
    | Northanger Abbey      |  78147 | 5995 |
    | Persuasion            |  84117 | 5738 |
    | Sense and Sensibility | 120716 | 6271 |
    | Pride and Prejudice   | 122811 | 6258 |
    | Mansfield Park        | 161454 | 7758 |
    | Emma                  | 161967 | 7092 |
    |-----------------------+--------+------|

The parameters in Heaps’ law work out to K = 121.3 and β = 0.341, a much larger K than before, and a smaller β.

Here’s a comparison of the actual and predicted vocabulary sizes in the novels.

    |-----------------------+------+-------|
    | Novel                 | True | Model |
    |-----------------------+------+-------|
    | Northanger Abbey      | 5995 |  5656 |
    | Persuasion            | 5738 |  5799 |
    | Sense and Sensibility | 6271 |  6560 |
    | Pride and Prejudice   | 6258 |  6598 |
    | Mansfield Park        | 7758 |  7243 |
    | Emma                  | 7092 |  7251 |
    |-----------------------+------+-------|

If a suspected posthumous manuscript of Jane Austen were to appear, a possible test of authenticity would be to look at its vocabulary size to see if it is consistent with her other works. One could also look at the number of words used only once, as we discuss next.

Hapax legomenon

In linguistics, a hapax legomenon is a word that only appears once in a given context. The term comes comes from a Greek phrase meaning something said only once. The term is often shortened to just hapax.

I thought it would be interesting to look at the number of hapax legomena in each book since I could do it with a minor tweak of the code I wrote for the first part of this post.

Normally if someone were speaking of hapax legomena in the context of the Bible, they’d be looking at unique words in the original languages, i.e. Hebrew and Greek, not in English translation. But I’m going with what I have at hand.

Here’s a plot of the number of haxap in each book of the KJV compared to the number of words in the book.

Hapax logemenon in Bible, linear scale

This looks a lot like the plot of vocabulary size and total words, suggesting the number of hapax also follow a power law like Heaps law. This is evident when we plot again on a logarithmic scale and see a linear relation.

Number of hapax logemena on a log-log scale

Just to be clear on the difference between two analyses this post, in the first we looked at vocabulary size, the number of distinct words in each book. In the second we looked at words that only appear once. In both cases we’re counting unique words, but unique in different senses. In the first analysis, unique means that each word only counts once, no matter how many times it’s used. In the second, unique means that a work only appears once.

Related posts

Writing down an unwritten language

In this post I interview Greg Greenlaw, a friend of mine who served as a missionary to the Nakui tribe in Papua New Guinea and developed their writing system. (Nakui is pronounced like “knock we.”)

JC: When you went to PNG to learn Nakui was there any writing system?

GG: No, they had no way of writing words or numbers. They had names for only seven numbers — that was the extent of their counting system — but they could coordinate meetings more than a few days future by tying an equal number of knots in two vines. Each party would take a vine with them and loosen a knot each morning until they counted down to the appointed time — like and advent calendar, but without numbers!

JC: I believe you said that Nakui has a similar grammar to other languages in PNG but completely different vocabulary. Were you able to benefit from any other translation work in the area? Will your work serve as a starting point for translating another language?

GG: Yes. Missionaries had moved into our general area back in the 80’s and the languages they studied had mostly the same grammar features that we saw in the Nakui language. This sped the process, but in truth, the slowest part of learning to speak was not analyzing the grammar, it was training yourself to use it. Each Nakui verb could conjugate 300 different ways depending on person, recipient, tense, aspect, direction, and number.

JC: Do any of the Nakui speak a pidgin or any other language that served as a bridge for you to learn their language?

GG: We were blessed with one very good pidgin speaker when we moved into the village and several others who spoke the Melanesian Tok Pisin at a moderate level. That was a help in eliciting phrases and getting correction, but there were limits to how much they understood about their own speech. They had never thought about their language analytically and could not state a reason for the different suffixes used in the language, they just knew it was the right way to say something. Most peculiar was that they couldn’t even tell where certain word breaks were. Their language had never been “visible” to them. Nouns were easy to separate out, but it was hard to know if certain modifiers were affixes or separate function words.

JC: What are a few things that English speakers would find surprising about the language?

GG: Nakui has very few adjectives but innumerable verbs. They do the majority of their describing by using a very specific verb. There is a different verb for cutting something in the middle as opposed to cutting off a small section or cutting it long ways or cutting something to fall downward or cutting something to bits. The correct verb for the situation will also depend on the object being cut — is it log-like, leaf-like, or flesh-like.

JC: Do you try to learn a new language aurally first before starting on a writing system, or do you start developing a writing system early on so you can take notes?

GG: It’s language, so you always want your ears and tongue to lead your eyes. We use the international phonetic alphabet to write down notes, but we encourage language learners to hold off on writing anything for almost a month. After they are memorizing and pronouncing correctly a host of nouns and some noun modifiers, then we encourage them to start writing down their data with the phonetic alphabet. About half way to fluency they usually have enough perspective on the variety of sounds and where those sounds occur that they can narrow down a useful alphabet.

JC: How long did it take to learn the language? To create the writing system? How many people were working on the project?

GG: Primarily Tim Askew and myself, but there were three other missionaries that worked among the Nakui in the early days and each had some contribution to our understanding. The biggest credit really belongs to God, by whose grace we were allowed to live there and by whose strength we were able to chip away for three years until the job was done.

JC: Can you describe the process you went through to create the writing system?

GG: Well after collecting thousands of words written phonetically we could start a process of regularizing all that data. Looking at the language as a whole we had to decided the sleekest, most uniform way to consistently spell those sounds phonetically. After cleaning up our data we could take a hard look at which of the sounds were unique and which are just modified by surrounding sounds. Phonetically the English language has three different ‘t’ sounds: an aspirated ‘t’, and unaspirated ‘t’, and an unreleased ‘t’. We only think of one ‘t’ sound, though, and it would be confusing (and irritating) if we had three ‘t’ symbols in our alphabet. Our ‘t’s just become unreleased automatically when they are at the end of a word and they become unaspirated when they blend with an ‘s’. You need to narrow down the number of sounds to just the ones that are significant and meaningful. It is only those unique sounds that are given letters in the alphabet.

JC: What is the alphabet you settled on? Do the letters represent sounds similar to what an English speaker would expect?

GG: We found 10 vowel sounds and 10 consonant sounds were needful in the Nakui language. The 10 consonants are symbolized the same as English, but the vowels had to be augmented. We borrowed the letter ‘v’ to become the short ‘u’ sound and used accent marks over other vowels to show their lowered and more nasalized quality.

JC: Could you give an example of something written in Nakui?

GG: This is John 3:16. Músvilv Sisasvyv me iyemvi, “Kotvlv asini niyanú nonú mvli mvlei ufau. Múmvimvilv, tuwani aluwalv siyv tuwayv itii, asinosv. Mvimeni notalv me svmvleliyelv, ‘Sisasvni yáfu kokuimvisv,’ múmeni nonv ba itinvini, i tínonvminvini.

JC: Is it unusual for a language to have as many vowel symbols as consonant symbols?

GG: Good question. I’m not sure what normal would be worldwide, but I know we have a lot more than five vowel sounds in English! All our English vowel symbols have double or triple allophones (long sounds, short sounds, alternate sounds) making it very challenging to instruct my 6 yr old in the skill of reading.

JC: Have you published anything on your work, say in a linguistics journal?

GG: No, there is nothing unique in what we did. Truth is, we organized our findings and reported our work in an informal, non-academic, pragmatic way because our goal was not a PhD in anthropology, but a redeeming ministry in the lives of tribal men and women and ultimately a commendation from the Maker of men.

JC: How many of the Nakui can read their language?

GG: All the young men in our village are readers now, with the exception of a dyslexic man who attempted our literacy course three times without result. About 1/3 of the young women can read. They have less available time for study, and in the early days the village leaders did not allow the girl to be in class with the boys. As a result, literacy for ladies had a slower start and fewer participants. There were older men who tried to learn to read, but none that succeeded. The gray matter seemed to have stiffened.

JC: If English didn’t have a writing system and you were designing one from scratch, what would you do differently?

GG: Wow, we have six kids we’ve taught to read – we would do a lot differently! The ideal is to have one symbol for each meaningful sound in the language. English would need more vowels and fewer consonants. Life would actually be more pleasant without ‘c’, ‘j’, ‘q’ and ‘x’. If you removed those and added about six vowels we could come up with a spelling system that was predictable and easy to learn.

JC: Anything else you’d like to say?

GG: One of the most remarkable things about the Nakui language is how sophisticated and complex it is. Unblended with other languages, it is extremely consistent in its grammar. Although it is in a constant state of change (rapid change relative to written languages of the world), it remains organized. This order is not to the credit of the society that speaks it, it happens utterly in spite of them. They are completely unaware of the structure of their own grammar and are perhaps the most disorderly people on the planet. The aspects of life over which they hold sway are corrupt and miserable, but in this small, unmanaged area of their lives, a glimpse of the divine still peeks out. The God of order, who created them in His own image, gave them an innate ability to communicate complex ideas in and orderly way. Like Greek, they specify singular, dual and plural. They have six tenses (English has only three), and 11 personal pronouns (English has only six). Every verb is conjugated to carry this specificity with extreme consistency. The verb to ‘go’ is the only irregular verb in the whole Nakui language. English, although vast, is a disheveled mess compared to what is spoken by these stone age people in the swamps of interior New Guinea.

Related posts:

All languages equally complex

This post compares complexity in spoken languages and programming languages.

There is a theory in linguistics that all human languages are equally complex. Languages may distribute their complexity in different ways, but the total complexity is roughly the same across all spoken languages. One language may be simpler in some aspect than another but more complicated in some other respect. For example, Chinese has simple grammar but a complex tonal system.

Even if all languages are equally complex, that doesn’t mean all languages are equally difficult to learn. An English speaker might find French easier to learn than Russian, not because French is simpler than Russian in some objective sense, but because French is more similar to English.

All spoken languages are supposed to be equally complex because languages reach an equilibrium between at least two forces. Skilled adult speakers tend to complicate languages by looking for ways to be more expressive. But children must be able to learn their language relatively quickly, and less skilled speakers need to be able to use the language as well.

I wonder what this says about programming languages. There are analogous dynamics. Programming languages can be relatively simpler in some way while being relatively complex in another way. And programming languages become more complex over time due to the demands of skilled users.

But there are several important differences. Programming languages are part of a complex system of language, standard libraries, idioms, tools, etc. It may make more sense to speak of a programming “system” to make better comparisons, taking into account the language and its environment.

I do not think that all programming systems are equally complex. Some are better designed than others. Some are more appropriate for a given task than others. Some programming systems achieve simplicity by sacrificing efficiency. Some abstractions leak less than others.

On the other hand, I imagine the levels of complexity are more similar when comparing programming systems rather than just comparing programming languages.  Larry Wall said something to the effect that Perl is ugly so you can write beautiful programs in it. I think there’s some truth to that. A language can always be small and elegant by simply not providing much functionality, forcing the user to implement that functionality in application code.

See Larry Wall’s article Natural Language Principles in Perl for more comparisons of spoken languages and programming languages.

Related posts: