Instance-by-instance optimal identity testing

Gregory Valiant; Paul Valiant

Electronic Colloquium on Computational Complexity

Under the auspices of the Computational Complexity Foundation (CCF)

REPORTS > DETAIL:

Paper:

TR13-111 | 17th August 2013 07:42

Instance-by-instance optimal identity testing

TR13-111 Authors: Gregory Valiant, Paul Valiant
Publication: 17th August 2013 16:02
Downloads: 4225

Keywords:

distribution testing, Holder`s inequality, identity testing, lower bounds, Sample Complexity

Abstract:

We consider the problem of verifying the identity of a distribution: Given the description of a distribution over a discrete support p=(p_1,p_2,\ldots,p_n), how many samples (independent draws) must one obtain from an unknown distribution, q, to distinguish, with high probability, the case that p=q from the case that the total variation distance (L_1 distance) ||p-q||_1 \ge \epsilon? We resolve this question, up to constant factors, on an instance by instance basis: there exist universal constants c,c' and a function f(p,\epsilon) on distributions and error parameters, such that our tester distinguishes p=q from ||p-q||_1\ge \epsilon using f(p,\epsilon) samples with success probability >2/3, but no tester can distinguish p=q from ||p-q||_1\ge c\cdot \epsilon when given c'\cdot f(p,\epsilon) samples. The function f(p,\epsilon) is upper-bounded by a multiple of \frac{||p||_{2/3}}{\epsilon^2}, but is more complicated, and is significantly smaller in cases when p has many small domain elements. This result significantly generalizes and tightens previous results: since distributions of support at most n have L_{2/3} norm bounded by \sqrt{n}, this result immediately shows that for such distributions, O(\sqrt{n}/{\epsilon^2}) samples suffice, tightening the previous bound of O(\frac{\sqrt{n} polylog n}{\epsilon^4}) for this class of distributions, and matching the (tight) results for the case that p is the uniform distribution over support n.

The analysis of our very simple testing algorithm involves several hairy inequalities. To facilitate this analysis, we give a complete characterization of a general class of inequalities---generalizing Cauchy-Schwarz, Holder's inequality, and the monotonicity of L_p norms. Specifically, we characterize the set of sequences a=a_1,\ldots,a_m, b=b_1,\ldots,b_m, c=c_1\ldots,c_m, for which it holds that for all finite sequences of positive numbers x=x_1,\ldots and y=y_1,\ldots,

\prod_i \left( \sum_j x_j^{a_i} y_j^{b_i}\right)^{c_i} \ge 1.

The characterization is of a perhaps non-traditional nature in that it uses linear programming to compute a derivation that may otherwise have to be sought through trial and error, by hand. We do not believe such a characterization has appeared in the literature, and hope its computational nature will facilitate analyses like the one here.

ISSN 1433-8092 | Imprint