Sketching and Sequence Alignment: A Rate-Distortion Perspective

07/09/2021
by   Ilan Shomorony, et al.
0

Pairwise alignment of DNA sequencing data is a ubiquitous task in bioinformatics and typically represents a heavy computational burden. A standard approach to speed up this task is to compute "sketches" of the DNA reads (typically via hashing-based techniques) that allow the efficient computation of pairwise alignment scores. We propose a rate-distortion framework to study the problem of computing sketches that achieve the optimal tradeoff between sketch size and alignment estimation distortion. We consider the simple setting of i.i.d. error-free sources of length n and introduce a new sketching algorithm called "locational hashing." While standard approaches in the literature based on min-hashes require B = (1/D) · O( log n ) bits to achieve a distortion D, our proposed approach only requires B = log^2(1/D) · O(1) bits. This can lead to significant computational savings in pairwise alignment estimation.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro