#193 V̶VI
A word that only appears once in a corpus is called a hapax legomenon (plural: hapax legomena). Project Hail Mary has 2,760 hapax legomena, or 40.2% of its vocabulary of 6,862 distinct words.
Which characters speak the most of these unique words?
In absolute terms, the narrator contributes the most (2,030), followed by Grace (205) and Stratt (204). But who speaks the most hapax legomena as a portion of their speech?
Here are the proportions (per 1,000 tokens) for characters with at least 500 tokens of speech:
| Hapax | Tokens | per 1,000 | |
|---|---|---|---|
| Leclerc | 58 | 1,034 | 56.1 |
| DuBois | 18 | 623 | 28.9 |
| Dimitri | 23 | 849 | 27.1 |
| Steve Hatch | 28 | 1,049 | 26.7 |
| Stratt | 204 | 7,842 | 26.0 |
| Bob Redell | 25 | 1,047 | 23.9 |
| Dr. Lokken | 35 | 1,913 | 18.3 |
| Grace | 205 | 16,951 | 12.1 |
| Dr. Browne | 3 | 340 | 8.8 |
| Rocky | 38 | 5,501 | 6.9 |
Leclerc has the highest hapax rate of any major character at 56.1 per 1,000, more than double most other characters. His scenes are full of distinctive vocabulary.
Rocky has the lowest rate at 6.9 per 1,000 — less than half of Grace’s rate and a quarter of Leclerc’s. Rocky’s English vocabulary is deliberately limited, and it would be surprising if he used words not otherwise seen in the book.
Grace sits in the middle at 12.1 — he has plenty of vocabulary, but as the character with the most speech, his words tend to recur.
Hapax legomena are identified at the lemma level across the entire text (not just speech). A word counts as a hapax if its lemmatized form appears exactly once in the whole book.
Punctuation tokens are excluded from all counts.
Characters with fewer than 500 tokens were excluded from the rate table.
there’s only one
10.109