WARNING!

This website contains spoilers for Andy Weir’s Project Hail Mary.
It is recommended you read the book before exploring this site.

#71 IV̶V̶

What is the distribution of word length across Project Hail Mary?

The most common word length in terms of distinct words (types) is 6 with 1,651. In other words, there are 1,651 distinct words of length 6.

The most common word length in terms of actual occurrences (tokens) is 4 with 31,300. In other words, there are 31,300 occurrences in the text of words of length 4.

The longest words (including hyphens) include:

two-million-kilogram
xenonite-to-xenonite
(20 characters)

high-tensile-strength
second-from-the-right
radiation-containment
(21 characters)

righty-tighty-lefty-loosey
(26 characters)

Three-thousand-five-hundredth
(29 characters)

Splitting hyphens, the longest words are:

misunderstandings
astronavigational
compartmentalized
intergovernmental
(17 letters)

anthropomorphizing
(18 letters)

The chart below illustrates the distribution:

1 type token 2 type token 3 type token 4 type token 5 type token 6 type token 7 type token 8 type token 9 type token 10 type token 11 type token 12 type token 13 type token 14 type token 15 type token 16 type token 17 type token 18 type token 19 type token 20 type token 21 type token 26 type token 29 type token
1 type token 2 type token 3 type token 4 type token 5 type token 6 type token 7 type token 8 type token 9 type token 10 type token 11 type token 12 type token 13 type token 14 type token 15 type token 16 type token 17 type token 18 type token 19 type token 20 type token 21 type token 26 type token 29 type token

For tokenization purposes, we split on whitespace, emdash, endash, ellipses, and solidus, and stripped all punctuation other than apostrophes and hyphens. We also only considered words with Latin letters (including with diacritics) and so eliminated Chinese and Russian words, abbreviations, email addresses, numbers, times, quantities, music notation, and Eridian numbers. Hyphenated words were considered as a single unit. Note that there are still an unusually high proportion of shorter words because of units (where they contain only letters) or partial words.