How do the frequencies of chemical element names in English text compare to the abundance of elements in Earth’s crust? Do we write most frequently about the elements that appear most frequently?
It turns out the answer is “not really.” The rarest elements rarely appear in writing. We don’t have much to say about dysprosium, thulium, or lutetium, for example. But overall there’s only a small correlation between word frequency and chemical frequency. (The rank correlation is substantially higher than ordinary linear correlation.)
We write often about things like oxygen and iron because they’re such a part of the human experience. On the other hand, we care about some things like silver and gold precisely because they are rare.
Here are the most common elements according to text usage.
|------------+--------+-----------+---------+------------| | element | word % | word rank | earth % | earth rank | |------------+--------+-----------+---------+------------| | lead | 15.50 | 1 | 0.001 | 36 | | gold | 11.64 | 2 | 0.000 | 75 | | iron | 11.14 | 3 | 5.612 | 4 | | silver | 7.38 | 4 | 0.000 | 68 | | carbon | 5.15 | 5 | 0.012 | 17 | | oxygen | 5.13 | 6 | 45.956 | 1 | | copper | 4.61 | 7 | 0.006 | 26 | | hydrogen | 3.51 | 8 | 0.139 | 10 | | sodium | 3.38 | 9 | 2.352 | 6 | | calcium | 2.84 | 10 | 4.137 | 5 | | nitrogen | 2.79 | 11 | 0.002 | 34 | | mercury | 2.22 | 12 | 0.000 | 67 | | tin | 2.13 | 13 | 0.000 | 51 | | potassium | 1.94 | 14 | 2.083 | 8 | | zinc | 1.70 | 15 | 0.007 | 24 | | silicon | 1.12 | 16 | 28.112 | 2 | | nickel | 1.08 | 17 | 0.008 | 23 | | phosphorus | 1.05 | 18 | 0.104 | 11 | | magnesium | 0.98 | 19 | 2.322 | 7 | | sulfur | 0.84 | 20 | 0.035 | 16 | |------------+--------+-----------+---------+------------|
This is based on the Google book corpus summarized here. There’s some ambiguity; I imagine most used of “lead” are the verb and not the element name. Some portion of the uses of “iron” refer to a device for smoothing wrinkles out of clothes.
Word percentage is relative to the set of chemical element names. Earth percentage is relative to the Earth’s crust.
The percentages above have been truncated for presentation’ obviously the abundance of gold, silver, mercury, and tin is not zero, though it is when rounded to three decimal places. The full data for the first 111 elements is available here.
There is also the issue of letters such as o and l being treated as digits, plus sulphur being an element outside the US ;-)
In some cases Google books translated oxford as a hexidecimal literal:
https://shape-of-code.coding-guidelines.com/2011/02/28/365/
I dread to think what might have happened to oxygen!
We might also consider including other forms of elements; for example, chlorine probably ranks higher if you also include mentions of chloride. There’s some room for debate about what exactly should count. (Should “ferrous” count as a mention of iron? How about “rust” or “steel”?) In any case, it seems wrong that, by an accident of nomenclature, “sodium chloride” counts as a reference to sodium but not to chlorine.