REPORTS > DETAIL:

### Revision(s):

Revision #1 to TR10-179 | 11th May 2011 03:28

#### A CLT and tight lower bounds for estimating entropy

Revision #1
Authors: Gregory Valiant, Paul Valiant
Accepted on: 11th May 2011 03:28
Keywords:

Abstract:

We prove two new multivariate central limit theorems; the first relates the sum of independent distributions to the multivariate Gaussian of corresponding mean and covariance, under the earthmover distance matric (also known as the Wasserstein metric). We leverage this central limit theorem to prove a stronger but more specific central limit theorem for generalized multinomial" distributions---a large class of discrete distributions, parameterized by matrices, that generalize binomial and multinomial distributions, and describe many distributions encountered in computer science. This central limit theorem relates a generalized multinomial distribution to a multivariate Gaussian distribution, discretized by rounding to the nearest lattice points. In contrast to the metric of our first central limit theorem, this bound is in terms of statistical distance, which immediately implies that any algorithm with input drawn from a generalized multinomial distribution behaves essentially as if the input were drawn from a discretized Gaussian with the same mean and covariance. Such tools in the multivariate setting are rare, and we hope this new tool will be of use to the community.

In the second part of the paper, we employ this central limit theorem to establish a lower bound of $\Omega(\frac{n}{\log n})$ on the sample complexity of additively estimating the entropy or support size of a distribution (where $1/n$ is a lower bound on the probability of any element in the domain). Together with the canonical estimator constructed in the companion paper [33], this settles the longstanding open question of the sample complexities of estimating these properties to additive constants, up to constant factors. In particular, for any constants $c_1 c_1,$ and whose support sizes differ by at least $c_2 n$, such that no algorithm on $o(\frac{n}{\log n})$ samples can distinguish $D$ from $D'$ with probability greater than $2/3.$ For the problem of estimating entropy, we also provide a bound on the rate of convergence of an optimal estimator, showing that the sample complexity of estimating entropy to within additive $c$ is $\Omega\left(\frac{n}{c \log n}\right).$ The previous lower-bounds on these sample complexities were $n/2^{\Theta(\sqrt{\log n})}$, for constant $c$, from [34]. We explicitly exhibit such a family of pairs of distributions $D,D'$ via a Laguerre polynomial construction that may be of independent interest.

Changes to previous version:

Fixed many typos. The most significant change is to the statement of Theorem 1, where we removed some excess wording that had confusing and incorrect asymptotics.

### Paper:

TR10-179 | 18th November 2010 09:55

#### A CLT and tight lower bounds for estimating entropy

TR10-179
Authors: Gregory Valiant, Paul Valiant
Publication: 18th November 2010 16:11
We prove two new multivariate central limit theorems; the first relates the sum of independent distributions to the multivariate Gaussian of corresponding mean and covariance, under the earthmover distance matric (also known as the Wasserstein metric). We leverage this central limit theorem to prove a stronger but more specific central limit theorem for generalized multinomial" distributions---a large class of discrete distributions, parameterized by matrices, that generalize binomial and multinomial distributions, and describe many distributions encountered in computer science. This central limit theorem relates a generalized multinomial distribution to a multivariate Gaussian distribution, discretized by rounding to the nearest lattice points. In contrast to the metric of our first central limit theorem, this bound is in terms of statistical distance, which immediately implies that any algorithm with input drawn from a generalized multinomial distribution behaves essentially as if the input were drawn from a discretized Gaussian with the same mean and covariance. Such tools in the multivariate setting are rare, and we hope this new tool will be of use to the community.
In the second part of the paper, we employ this central limit theorem to establish a lower bound of $\Omega(\frac{n}{\log n})$ on the sample complexity of additively estimating the entropy or support size of a distribution (where $1/n$ is a lower bound on the probability of any element in the domain). Together with the canonical estimator constructed in the companion paper [33], this settles the longstanding open question of the sample complexities of these estimation problems, up to constant factors. In particular, for any constant $c$, there is a pair of distributions $D,D'$ each of whose elements occurs with probability at least $1/n,$ and whose entropies satisfy $H(D)-H(D') > c,$ such that no algorithm on $o(\frac{1}{c\log (1/c)} \frac{n}{\log n})$ samples can distinguish $D$ from $D'$ with probability greater than $2/3,$ and analogously for the problem of estimating the support size. The previous lower-bounds on these sample complexities were $n/2^{\Theta(\sqrt{\log n})}$, for constant $c$, from [34]. We explicitly exhibit such a pair of distributions via a Laguerre polynomial construction that may be of independent interest.