Practical KMP/BM Style Pattern-Matching on Indeterminate Strings
In this paper we describe two simple, fast, space-efficient algorithms for finding all matches of an indeterminate pattern p = p[1..m] in an indeterminate string x = x[1..n], where both p and x are defined on a "small" ordered alphabet Σ - say, σ = |Σ| ≤ 9. Both algorithms depend on a preprocessing phase that replaces Σ by an integer alphabet Σ_I of size σ_I = σ which (reversibly, in time linear in string length) maps both x and p into equivalent regular strings y and q, respectively, on Σ_I, whose maximum (indeterminate) letter can be expressed in a 32-bit word (for σ≤ 4, thus for DNA sequences, an 8-bit representation suffices). We first describe an efficient version KMP Indet of the venerable Knuth-Morris-Pratt algorithm to find all occurrences of q in y (that is, of p in x), but, whenever necessary, using the prefix array, rather than the border array, to control shifts of the transformed pattern q along the transformed string y. We go on to describe a similar efficient version BM Indet of the Boyer- Moore algorithm that turns out to execute significantly faster than KMP Indet over a wide range of test cases. A noteworthy feature is that both algorithms require very little additional space: Θ(m) words. We conjecture that a similar approach may yield practical and efficient indeterminate equivalents to other well-known pattern-matching algorithms, in particular the several variants of Boyer-Moore.
READ FULL TEXT