Gapped String Indexing in Subquadratic Space and Sublinear Query Time
In Gapped String Indexing, the goal is to compactly represent a string S of length n such that given queries consisting of two strings P_1 and P_2, called patterns, and an integer interval [α, β], called gap range, we can quickly find occurrences of P_1 and P_2 in S with distance in [α, β]. Due to the many applications of this fundamental problem in computational biology and elsewhere, there is a great body of work for restricted or parameterised variants of the problem. However, for the general problem statement, no improvements upon the trivial 𝒪(n)-space 𝒪(n)-query time or Ω(n^2)-space Õ(|P_1| + |P_2| + occ)-query time solutions were known so far. We break this barrier obtaining interesting trade-offs with polynomially subquadratic space and polynomially sublinear query time. In particular, we show that, for every 0≤δ≤ 1, there is a data structure for Gapped String Indexing with either Õ(n^2-δ/3) or Õ(n^3-2δ) space and Õ(|P_1| + |P_2| + n^δ· (occ+1)) query time, where occ is the number of reported occurrences. As a new fundamental tool towards obtaining our main result, we introduce the Shifted Set Intersection problem: preprocess a collection of sets S_1, …, S_k of integers such that given queries consisting of three integers i,j,s, we can quickly output YES if and only if there exist a ∈ S_i and b ∈ S_j with a+s = b. We start by showing that the Shifted Set Intersection problem is equivalent to the indexing variant of 3SUM (3SUM Indexing) [Golovnev et al., STOC 2020]. Via several steps of reduction we then show that the Gapped String Indexing problem reduces to polylogarithmically many instances of the Shifted Set Intersection problem.
READ FULL TEXT