Gapped String Indexing in Subquadratic Space and Sublinear Query Time

11/30/2022
by   Philip Bille, et al.
0

In Gapped String Indexing, the goal is to compactly represent a string S of length n such that given queries consisting of two strings P_1 and P_2, called patterns, and an integer interval [α, β], called gap range, we can quickly find occurrences of P_1 and P_2 in S with distance in [α, β]. Due to the many applications of this fundamental problem in computational biology and elsewhere, there is a great body of work for restricted or parameterised variants of the problem. However, for the general problem statement, no improvements upon the trivial 𝒪(n)-space 𝒪(n)-query time or Ω(n^2)-space Õ(|P_1| + |P_2| + occ)-query time solutions were known so far. We break this barrier obtaining interesting trade-offs with polynomially subquadratic space and polynomially sublinear query time. In particular, we show that, for every 0≤δ≤ 1, there is a data structure for Gapped String Indexing with either Õ(n^2-δ/3) or Õ(n^3-2δ) space and Õ(|P_1| + |P_2| + n^δ· (occ+1)) query time, where occ is the number of reported occurrences. As a new fundamental tool towards obtaining our main result, we introduce the Shifted Set Intersection problem: preprocess a collection of sets S_1, …, S_k of integers such that given queries consisting of three integers i,j,s, we can quickly output YES if and only if there exist a ∈ S_i and b ∈ S_j with a+s = b. We start by showing that the Shifted Set Intersection problem is equivalent to the indexing variant of 3SUM (3SUM Indexing) [Golovnev et al., STOC 2020]. Via several steps of reduction we then show that the Gapped String Indexing problem reduces to polylogarithmically many instances of the Shifted Set Intersection problem.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset