Sample-Optimal Identity Testing with High Probability

Ilias Diakonikolas; Themis Gouleakis; John Peebles; Eric Price

Electronic Colloquium on Computational Complexity

Under the auspices of the Computational Complexity Foundation (CCF)

REPORTS > DETAIL:

Paper:

TR17-133 | 7th September 2017 08:24

Sample-Optimal Identity Testing with High Probability

TR17-133 Authors: Ilias Diakonikolas, Themis Gouleakis, John Peebles, Eric Price
Publication: 8th September 2017 14:01
Downloads: 1284

Keywords:

distribution testing, identity testing

Abstract:

We study the problem of testing identity against a given distribution (a.k.a. goodness-of-fit) with a focus on the high confidence regime. More precisely, given samples from an unknown distribution p over n elements, an explicitly given distribution q, and parameters 0< \epsilon, \delta < 1, we wish to distinguish, {\em with probability at least 1-\delta}, whether
the distributions are identical versus \epsilon-far in total variation (or statistical) distance. Existing work has focused on the constant confidence regime, i.e., the case that \delta = \Omega(1), for which the sample complexity of identity testing is known to be \Theta(\sqrt{n}/\epsilon^2).

Typical applications of distribution property testing require small values of the confidence parameter \delta (which correspond to small ``p-values'' in the statistical hypothesis testing terminology). Prior work achieved arbitrarily small values of \delta via black-box amplification, which multiplies the required number of samples by \Theta(\log(1/\delta)). We show that this upper bound is suboptimal for any \delta = o(1), and give a new identity tester that achieves the optimal sample complexity. Our new upper and lower bounds show that the optimal sample complexity of identity testing is

\Theta\left( \frac{1}{\epsilon^2}\left(\sqrt{n \log(1/\delta)} + \log(1/\delta) \right)\right)

for any n, \epsilon, and \delta. For the special case of uniformity testing, where the given distribution is the uniform distribution U_n over the domain, our new tester is surprisingly simple: to test whether p = U_n versus \dtv (p, U_n) \geq \epsilon, we simply threshold \dtv(\wh{p}, U_n), where \wh{p} is the empirical probability distribution. We believe that our novel analysis techniques may be useful for other distribution testing problems as well.

ISSN 1433-8092 | Imprint