#71 IV̶V̶
What is the distribution of word length across Project Hail Mary?
The most common word length in terms of distinct words (types) is 6 with 1,651. In other words, there are 1,651 distinct words of length 6.
The most common word length in terms of actual occurrences (tokens) is 4 with 31,300. In other words, there are 31,300 occurrences in the text of words of length 4.
The longest words (including hyphens) include:
two-million-kilogram
xenonite-to-xenonite
(20 characters)
high-tensile-strength
second-from-the-right
radiation-containment
(21 characters)
righty-tighty-lefty-loosey
(26 characters)
Three-thousand-five-hundredth
(29 characters)
Splitting hyphens, the longest words are:
misunderstandings
astronavigational
compartmentalized
intergovernmental
(17 letters)
anthropomorphizing
(18 letters)
The chart below illustrates the distribution:
For tokenization purposes, we split on whitespace, emdash, endash, ellipses, and solidus, and stripped all punctuation other than apostrophes and hyphens. We also only considered words with Latin letters (including with diacritics) and so eliminated Chinese and Russian words, abbreviations, email addresses, numbers, times, quantities, music notation, and Eridian numbers. Hyphenated words were considered as a single unit. Note that there are still an unusually high proportion of shorter words because of units (where they contain only letters) or partial words.