Minimum Segmentation for Pan-genomic Founder Reconstruction in Optimal Time
Given a threshold L and a set R = {R_1, ..., R_m} of m haplotype sequences, each having length n, the minimum segmentation problem for founder reconstruction is to partition the sequences into disjoint segments R[i_1+1,i_2], R[i_2+1, i_3], ..., R[i_r-1+1, i_r], where 0 = i_1 < ... < i_r = n and R[i_j-1+1, i_j] is the set {R_1[i_j-1+1, i_j], ..., R_m[i_j-1+1, i_j]}, such that the length of each segment, i_j - i_j-1, is at least L and K = _j{ |R[i_j-1+1, i_j]| } is minimized. The distinct substrings in the segments R[i_j-1+1, i_j] represent founder blocks that can be concatenated to form K founder sequences representing the original R such that crossovers happen only at segment boundaries. We give an optimal O(mn) time algorithm to solve the problem, improving over earlier O(mn^2). This improvement enables to exploit the algorithm on a pan-genomic setting of haplotypes being complete human chromosomes, with a goal of finding a representative set of references that can be indexed for read alignment and variant calling.
READ FULL TEXT