Computing all-vs-all MEMs in grammar-compressed text
We describe a compression-aware method to compute all-vs-all maximal exact matches (MEM) among strings of a repetitive collection 𝒯. The key concept in our work is the construction of a fully-balanced grammar 𝒢 from 𝒯 that meets a property that we call fix-free: the expansions of the nonterminals that have the same height in the parse tree form a fix-free set (i.e., prefix-free and suffix-free). The fix-free property allows us to compute the MEMs of 𝒯 incrementally over 𝒢 using a standard suffix-tree-based MEM algorithm, which runs on a subset of grammar rules at a time and does not decompress nonterminals. By modifying the locally-consistent grammar of Christiansen et al 2020., we show how we can build 𝒢 from 𝒯 in linear time and space. We also demonstrate that our MEM algorithm runs on top of 𝒢 in O(G +occ) time and uses O(log G(G+occ)) bits, where G is the grammar size, and occ is the number of MEMs in 𝒯. In the conclusions, we discuss how our idea can be modified to implement approximate pattern matching in compressed space.
READ FULL TEXT