Towards Better Bounds for Finding Quasi-Identifiers
We revisit the problem of finding small ϵ-separation keys introduced by Motwani and Xu (2008). In this problem, the input is m-dimensional tuples x_1,x_2,…,x_n. The goal is to find a small subset of coordinates that separates at least (1-ϵ)n 2 pairs of tuples. They provided a fast algorithm that runs on Θ(m/ϵ) tuples sampled uniformly at random. We show that the sample size can be improved to Θ(m/√(ϵ)). Our algorithm also enjoys a faster running time. To obtain this result, we provide upper and lower bounds on the sample size to solve the following decision problem. Given a subset of coordinates A, reject if A separates fewer than (1-ϵ)n 2 pairs, and accept if A separates all pairs. The algorithm must be correct with probability at least 1-δ for all A. We show that for algorithms based on sampling: - Θ(m/√(ϵ)) samples are sufficient and necessary so that δ≤ e^-m and - Ω(√(log m/ϵ)) samples are necessary so that δ is a constant. Our analysis is based on a constrained version of the balls-into-bins problem. We believe our analysis may be of independent interest. We also study a related problem that asks for the following sketching algorithm: with given parameters α,k and ϵ, the algorithm takes a subset of coordinates A of size at most k and returns an estimate of the number of unseparated pairs in A up to a (1±ϵ) factor if it is at least αn 2. We show that even for constant α and success probability, such a sketching algorithm must use Ω(mk logϵ^-1) bits of space; on the other hand, uniform sampling yields a sketch of size Θ(mk log m/αϵ^2) for this purpose.
READ FULL TEXT