WARNING!

This website contains spoilers for Andy Weir’s Project Hail Mary.
It is recommended you read the book before exploring this site.

#7 II

There are 10,402 unique words in Project Hail Mary.

Of these, 23 occur more than a thousand times; 232 occur more than one hundred times; 1,513 occur more than ten times; 5,815 occur more than once; and 4,587 occur only once.

These counts rely on the tokenization performed by the NLP library, spaCy.

If, instead, we just divide up the text using whitespace as a delimiter, we get 19,114 unique words from 148,630 tokens.

Splitting on em-dashes, en-dashes, slashes, and ellipses as well, we get 18,409 unique words from 149,585 tokens.

But that has the problem that punctuation is still attached to words so that

  • “What
  • What!
  • What?

are counted as three distinct words.

spaCy avoids this problem by breaking off punctuation as a separate token.

We may refine these counts as manual corrections are made to this analysis.