Types, Tokens, and Hapaxes: A New Heap's Law
Heap's Law states that in a large enough text corpus, the number of types as a function of tokens grows as N=KM^β for some free parameters K,β. Much has been written about how this result and various generalizations can be derived from Zipf's Law. Here we derive from first principles a completely novel expression of the type-token curve and prove its superior accuracy on real text. This expression naturally generalizes to equally accurate estimates for counting hapaxes and higher n-legomena.
READ FULL TEXT