{"id":81963,"date":"2021-08-16T14:38:14","date_gmt":"2021-08-16T19:38:14","guid":{"rendered":"https:\/\/www.johndcook.com\/blog\/?p=81963"},"modified":"2023-02-06T08:27:49","modified_gmt":"2023-02-06T14:27:49","slug":"initial-letter-frequency","status":"publish","type":"post","link":"https:\/\/www.johndcook.com\/blog\/2021\/08\/16\/initial-letter-frequency\/","title":{"rendered":"Initial letter frequency"},"content":{"rendered":"<p>I needed to know the frequencies of letters at the beginning of words for a project. The overall frequency of letters, wherever they appear in a word, is well known. Initial frequencies are not so common, so I did a little experiment.<\/p>\n<p>I downloaded the <a href=\"https:\/\/corpus.canterbury.ac.nz\/descriptions\/\">Canterbury Corpus<\/a> and looked at the frequency of initial letters in a couple of the files in the corpus. I first tried a different approach, then realized a shell one-liner [1] would be simpler and less-error prone.<\/p>\n<pre>cat alice29.txt | lc | grep -o '\\b[a-z]' | sort | uniq -c | sort -rn<\/pre>\n<p>This shows that the letters in descending order of frequency at the beginning of a word are <em>t<\/em>, <em>a<\/em>, <em>s<\/em>, \u2026, <em>j<\/em>, <em>x<\/em>, <em>z<\/em>.<\/p>\n<p>The file <code>alice29.txt<\/code> is the text of <em>Alice&#8217;s Adventures in Wonderland<\/em>. Then for comparison I ran the same script on another file, <code>lcet10.txt<\/code>. a lengthy report from a workshop on electronic texts.<\/p>\n<p>This technical report&#8217;s initial letter frequencies order the alphabet <em>t<\/em>, <em>a<\/em>, <em>o<\/em>, \u2026, <em>y<\/em>, <em>z<\/em>, <em>x<\/em>. So starting with the third letter, the two files have different initial letter frequencies.<\/p>\n<p>I made the following plot to visualize how the frequencies differ. The horizontal axis is sorted by overall letter frequency (based on the Google corpus summarized <a href=\"http:\/\/norvig.com\/mayzner.html\">here<\/a>).<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-medium\" src=\"https:\/\/www.johndcook.com\/alice_tech_initial.png\" width=\"640\" height=\"480\" \/><\/p>\n<p>I expected the initial letter frequencies to differ from overall letter frequencies, but I did not expect the two corpora to differ.<\/p>\n<p>Apparently initial letter frequencies vary more across corpora than overall letter frequencies. The following plot shows the overall letter frequencies for both corpora, with the horizontal axis again sorted by the frequency in the Google corpus.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-medium\" src=\"https:\/\/www.johndcook.com\/alice_tech_all.png\" width=\"640\" height=\"480\" \/><\/p>\n<p>Here the two corpora essentially agree with each other and with the Google corpus. The tech report ranks letters in essentially the same order as the Google corpus because the orange dashed line is mostly decreasing, though there is a curious kink in the graph at <em>c<\/em>.<\/p>\n<h2>Related posts<\/h2>\n<ul>\n<li class=\"link\"><a href=\"https:\/\/www.johndcook.com\/blog\/2016\/09\/02\/etaoin-shrdlu-and-all-that\/\">ETAOIN SHRDLU and all that<\/a><\/li>\n<li class=\"link\"><a href=\"https:\/\/www.johndcook.com\/blog\/2019\/10\/18\/chinese-character-entropy\/\">Chinese character frequency and entropy<\/a><\/li>\n<li class=\"link\"><a href=\"https:\/\/www.johndcook.com\/blog\/2021\/08\/14\/index-of-coincidence\/\">Index of coincidence<\/a><\/li>\n<\/ul>\n<p>[1] The <code>lc<\/code> function converts its input to lower case. See <a href=\"https:\/\/www.johndcook.com\/blog\/2021\/07\/22\/case-folding\/\">this post<\/a> for how to install and use the function.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I needed to know the frequencies of letters at the beginning of words for a project. The overall frequency of letters, wherever they appear in a word, is well known. Initial frequencies are not so common, so I did a little experiment. I downloaded the Canterbury Corpus and looked at the frequency of initial letters [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[17],"tags":[106],"class_list":["post-81963","post","type-post","status-publish","format-standard","hentry","category-statistics","tag-probability-and-statistics"],"acf":[],"aioseo_notices":[],"aioseo_head":"\n\t\t<!-- All in One SEO 4.9.8 - aioseo.com -->\n\t<meta name=\"description\" content=\"The frequencies of letters at the beginning of English words is different than the overall frequency, and the distribution changes a lot between corpora.\" \/>\n\t<meta name=\"robots\" content=\"max-image-preview:large\" \/>\n\t<meta name=\"author\" content=\"John\"\/>\n\t<meta name=\"keywords\" content=\"probability and statistics\" \/>\n\t<link rel=\"canonical\" href=\"https:\/\/www.johndcook.com\/blog\/2021\/08\/16\/initial-letter-frequency\/\" \/>\n\t<meta name=\"generator\" content=\"All in One SEO (AIOSEO) 4.9.8\" \/>\n\t\t<meta property=\"og:locale\" content=\"en_US\" \/>\n\t\t<meta property=\"og:site_name\" content=\"John D. Cook | Applied Mathematics Consulting\" \/>\n\t\t<meta property=\"og:type\" content=\"article\" \/>\n\t\t<meta property=\"og:title\" content=\"Frequency of letters at beginning of English words\" \/>\n\t\t<meta property=\"og:description\" content=\"The frequencies of letters at the beginning of English words is different than the overall frequency, and the distribution changes a lot between corpora.\" \/>\n\t\t<meta property=\"og:url\" content=\"https:\/\/www.johndcook.com\/blog\/2021\/08\/16\/initial-letter-frequency\/\" \/>\n\t\t<meta property=\"article:published_time\" content=\"2021-08-16T19:38:14+00:00\" \/>\n\t\t<meta property=\"article:modified_time\" content=\"2023-02-06T14:27:49+00:00\" \/>\n\t\t<meta name=\"twitter:card\" content=\"summary\" \/>\n\t\t<meta name=\"twitter:title\" content=\"Frequency of letters at beginning of English words\" \/>\n\t\t<meta name=\"twitter:description\" content=\"The frequencies of letters at the beginning of English words is different than the overall frequency, and the distribution changes a lot between corpora.\" \/>\n\t\t<meta name=\"twitter:image\" content=\"https:\/\/www.johndcook.com\/blog\/wp-content\/uploads\/2022\/05\/twittercard.png\" \/>\n\t\t<!-- All in One SEO -->\n\n","aioseo_head_json":{"title":"Frequency of letters at beginning of English words","description":"The frequencies of letters at the beginning of English words is different than the overall frequency, and the distribution changes a lot between corpora.","canonical_url":"https:\/\/www.johndcook.com\/blog\/2021\/08\/16\/initial-letter-frequency\/","robots":"max-image-preview:large","keywords":"probability and statistics","webmasterTools":{"miscellaneous":""},"schema":null,"og:locale":"en_US","og:site_name":"John D. Cook | Applied Mathematics Consulting","og:type":"article","og:title":"Frequency of letters at beginning of English words","og:description":"The frequencies of letters at the beginning of English words is different than the overall frequency, and the distribution changes a lot between corpora.","og:url":"https:\/\/www.johndcook.com\/blog\/2021\/08\/16\/initial-letter-frequency\/","article:published_time":"2021-08-16T19:38:14+00:00","article:modified_time":"2023-02-06T14:27:49+00:00","twitter:card":"summary","twitter:title":"Frequency of letters at beginning of English words","twitter:description":"The frequencies of letters at the beginning of English words is different than the overall frequency, and the distribution changes a lot between corpora.","twitter:image":"https:\/\/www.johndcook.com\/blog\/wp-content\/uploads\/2022\/05\/twittercard.png"},"aioseo_meta_data":{"post_id":"81963","title":"Frequency of letters at beginning of English words","description":"The frequencies of letters at the beginning of English words is different than the overall frequency, and the distribution changes a lot between corpora.","keywords":[],"keyphrases":{"focus":{"keyphrase":"","score":0,"analysis":{"keyphraseInTitle":{"score":0,"maxScore":9,"error":1}}},"additional":[]},"primary_term":null,"canonical_url":null,"og_title":null,"og_description":null,"og_object_type":"default","og_image_type":"default","og_image_url":null,"og_image_width":null,"og_image_height":null,"og_image_custom_url":null,"og_image_custom_fields":null,"og_video":"","og_custom_url":null,"og_article_section":null,"og_article_tags":[],"twitter_use_og":false,"twitter_card":"default","twitter_image_type":"default","twitter_image_url":null,"twitter_image_custom_url":null,"twitter_image_custom_fields":null,"twitter_title":null,"twitter_description":null,"schema":{"blockGraphs":[],"customGraphs":[],"default":{"data":{"Article":[],"Course":[],"Dataset":[],"FAQPage":[],"Movie":[],"Person":[],"Product":[],"ProductReview":[],"Car":[],"Recipe":[],"Service":[],"SoftwareApplication":[],"WebPage":[]},"graphName":"Article","isEnabled":true},"graphs":[],"defaultGraph":"Article","defaultPostTypeGraph":""},"schema_type":"default","schema_type_options":"{\"article\":{\"articleType\":\"BlogPosting\"},\"course\":{\"name\":\"\",\"description\":\"\",\"provider\":\"\"},\"faq\":{\"pages\":[]},\"product\":{\"reviews\":[]},\"recipe\":{\"ingredients\":[],\"instructions\":[],\"keywords\":[]},\"software\":{\"reviews\":[],\"operatingSystems\":[]},\"webPage\":{\"webPageType\":\"WebPage\"}}","pillar_content":false,"robots_default":true,"robots_noindex":false,"robots_noarchive":false,"robots_nosnippet":false,"robots_nofollow":false,"robots_noimageindex":false,"robots_noodp":false,"robots_notranslate":false,"robots_max_snippet":"-1","robots_max_videopreview":"-1","robots_max_imagepreview":"large","priority":null,"frequency":"default","location":null,"local_seo":null,"breadcrumb_settings":null,"limit_modified_date":false,"created":"2021-08-16 15:56:26","updated":"2025-06-04 02:12:11","ai":null,"seo_analyzer_scan_date":null},"aioseo_breadcrumb":"<div class=\"aioseo-breadcrumbs\"><span class=\"aioseo-breadcrumb\">\n\t\t\t<a href=\"https:\/\/www.johndcook.com\/blog\" title=\"Home\">Home<\/a>\n\t\t<\/span><span class=\"aioseo-breadcrumb-separator\">&raquo;<\/span><span class=\"aioseo-breadcrumb\">\n\t\t\t<a href=\"https:\/\/www.johndcook.com\/blog\/category\/statistics\/\" title=\"Statistics\">Statistics<\/a>\n\t\t<\/span><span class=\"aioseo-breadcrumb-separator\">&raquo;<\/span><span class=\"aioseo-breadcrumb\">\n\t\t\tInitial letter frequency\n\t\t<\/span><\/div>","aioseo_breadcrumb_json":[{"label":"Home","link":"https:\/\/www.johndcook.com\/blog"},{"label":"Statistics","link":"https:\/\/www.johndcook.com\/blog\/category\/statistics\/"},{"label":"Initial letter frequency","link":"https:\/\/www.johndcook.com\/blog\/2021\/08\/16\/initial-letter-frequency\/"}],"_links":{"self":[{"href":"https:\/\/www.johndcook.com\/blog\/wp-json\/wp\/v2\/posts\/81963","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.johndcook.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.johndcook.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.johndcook.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.johndcook.com\/blog\/wp-json\/wp\/v2\/comments?post=81963"}],"version-history":[{"count":0,"href":"https:\/\/www.johndcook.com\/blog\/wp-json\/wp\/v2\/posts\/81963\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.johndcook.com\/blog\/wp-json\/wp\/v2\/media?parent=81963"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.johndcook.com\/blog\/wp-json\/wp\/v2\/categories?post=81963"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.johndcook.com\/blog\/wp-json\/wp\/v2\/tags?post=81963"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}