#7 II
There are 10,402 unique words in Project Hail Mary.
Of these, 23 occur more than a thousand times; 232 occur more than one hundred times; 1,513 occur more than ten times; 5,815 occur more than once; and 4,587 occur only once.
These counts rely on the tokenization performed by the NLP library, spaCy.
If, instead, we just divide up the text using whitespace as a delimiter, we get 19,114 unique words from 148,630 tokens.
Splitting on em-dashes, en-dashes, slashes, and ellipses as well, we get 18,409 unique words from 149,585 tokens.
But that has the problem that punctuation is still attached to words so that
- “What
- What!
- What?
are counted as three distinct words.
spaCy avoids this problem by breaking off punctuation as a separate token.
We may refine these counts as manual corrections are made to this analysis.