HyperMinHash: Jaccard index sketching in LogLog space
In this extended abstract, we describe and analyse a streaming probabilistic sketch, HYPERMINHASH, to estimate the Jaccard index (or Jaccard similarity coefficient) over two sets A and B. HyperMinHash can be thought of as a compression of standard MinHash by building off of a HyperLogLog count-distinct sketch. Given Jaccard index δ, using k buckets of size O((l) + (|A ∪ B|)) (in practice, typically 2 bytes) per set, HyperMinHash streams over A and B and generates an estimate of the Jaccard index δ with error O(1/l + √(k/δ)). This improves on the best previously known sketch, MinHash, which requires the same number of storage units (buckets), but using O((|A ∪ B|)) bit per bucket. For instance, our new algorithm allows estimating Jaccard indices of 0.01 for set cardinalities on the order of 10^19 with relative error of around 5 64KiB of memory; the previous state-of-the-art MinHash can only estimate Jaccard indices for cardinalities of 10^10 with the same memory consumption. Alternately, one can think of HyperMinHash as an augmentation of b-bit MinHash that enables streaming updates, unions, and cardinality estimation (and thus intersection cardinality by way of Jaccard), while using extra bits.
READ FULL TEXT