4
$\begingroup$

Let $\mathbf{x}\in\Bbb{R}^n$ be a multivariate normal vector, i.e., $\mathbf{x}\sim\mathcal{N}(\bar{\mathbf{x}},\Sigma)$, where the mean vector $\bar{\mathbf{x}}$ and the covariance matrix $\Sigma\in\Bbb{S}_{++}^n$ are given. Note that $\Bbb{S}_{++}^n$ denotes the set of symmetric positive definite $n\times n$ real matrices.

Also, let $h\colon\Bbb{R}^n\to\Bbb{R}$ be a real-valued function of $\mathbf{x}\sim\mathcal{N}(\bar{\mathbf{x}},\Sigma)$. Since $\mathbf{x}$ is a random vector, we can find the mean value of $h$ as $$ \bar{h} = \int_{\Bbb{R}^n}\! h(\mathbf{x})f(\mathbf{x}) \,\mathrm{d}\mathbf{x}, $$ where $f$ denotes the probability density function of $\mathbf{x}$.

In the simple case where $h$ is the identity function, i.e., $h(\mathbf{x})=\mathbf{x}$, the mean value of $h$ is just the mean value of $\mathbf{x}$; that is, $$ \bar{h} = \int_{\Bbb{R}^n}\! \mathbf{x}f(\mathbf{x}) \,\mathrm{d}\mathbf{x} = \bar{\mathbf{x}}. $$

Now, let's assume that $h$ is an arbitrary function and we can generate the following samples from $\mathcal{N}(\bar{\mathbf{x}},\Sigma)$: $\mathbf{x}_i\in\Bbb{R}^n$, $i=1,\ldots,N$.

I have the following questions:

A. Does it hold true that, as $N$ tends to infinity, the mean value of $h$ can be estimated by the following quantity? If so, is this a consequence of the central limit theorem? $$ \tilde{h} = \frac{1}{N}\sum_{i=1}^{N}h(\mathbf{x}_i) $$

B. How many samples, $N$, should we draw from the distribution so that we have a good (with some given error $\epsilon$?) estimation of $\bar{h}$? How is this related with the dimensionality of the input space?

For instance, in the case of $h(\mathbf{x})=\mathbf{x}$, how many samples do we need to have a good estimation of the mean $\bar{\mathbf{x}}$?

In general, in the case of an arbitrary function $h$, can we have any error bounds for choosing the sampling size $N$? Is there a method for finding such bounds based on the explicit form of $h$?


EDIT Based on the excellent answer of @Batman below, I tried the following (work in progress):

First attempt (Failed)

The McDiarmid’s inequality (aka the bounded-difference inequality). For completeness sake, I copy the following theorem from this monograph by Raginsky and Sason (Sect. 2.2.3, pp. 18-19):

Let $\mathcal{X}$ a set, and let $h\colon\mathcal{X}\to\Bbb{R}$ be a function that satisfies the bounded difference assumption:

$$ \sup_{x_1,\ldots,x_n,x_i^\prime} \lvert h(x_1,\ldots,x_{i-1},x_i,x_{i+1},\ldots,x_n) -h(x_1,\ldots,x_{i-1},x_i',x_{i+1},\ldots,x_n) \rvert\leq d_i $$ for every $1\leq i\leq n$, where $d_i$ are arbitrary non-negative real constants. This is equivalent to saying that, for every given $i$, the variation of the function $h$ with respect to its $i$-th coordinate is upper bounded by $d_i$.

Theorem (McDiarmid’s inequality). Let $\{X_k\}_{k=1}^{n}$ be independent (not necessarily identically distributed) random ariables taking values in a measurable space $\mathcal{X}$. Consider a random variable $U = h(x_1,\ldots,x_n)$, where $h\colon\mathcal{X}\to\Bbb{R}$ is a measurable function satisfying the bounded difference assumption. Then, for every $r\geq0$, $$ P\left(\lvert U-\Bbb{E}U\rvert\geq r\right) \leq 2\exp \left(-\frac{2r^2}{\sum_{k=1}^{n}d_k^2}\right) $$

The function $h$ I am interested in, is the so-called "hinge loss", i.e.$h(\mathbf{x})=\max(0, 1-y(\mathbf{w}^\top\mathbf{x}+b))$, where $\mathbf{w}$, $b$, and $y$ are given parameters.

It seems that the McDiarmid’s inequality is not appropriate, since it does not satisfy the bounded difference assumption.

So, now I'm looking for another such inequality appropriate for $h(\mathbf{x})=\max(0, 1-y(\mathbf{w}^\top\mathbf{x}+b))$.

However, besides this, what I still don't understand is how the sampling size $N$ (for estimating $\tilde{h} = \frac{1}{N}\sum_{i=1}^{N}h(\mathbf{x}_i)$) can be related to the "error" $r$ and the dimensionality $n$. Can you help on this particular issue?

Second attempt (Needs review)

Lipschitz functions of Gaussian variables

Let's first recall that a function $f\colon\Bbb{R}^n\to\Bbb{R}$ is $\mathcal{L}$-Lipschitz with respect to the Euclidean norm if $$ \lvert f(\mathbf{x})-f(\mathbf{y})\rvert\leq\mathcal{L}\lVert\mathbf{x}-\mathbf{y}\rVert. $$

Theorem: Let $\mathbf{x}=(x_1,\ldots,x_n)$ be a random vector of $n$ i.i.d. standard Gaussian variables, and let $f\colon\Bbb{R}^n\to\Bbb{R}$ be $\mathcal{L}$-Lipschitz with respect to the Euclidean norm $\lVert\cdot\rVert$. Then the variable $h(\mathbf{x})-\Bbb{E}[h(\mathbf{x})]$ is sub-Gaussian with parameter at most $\mathcal{L}$, and hence $$ P\left(\lvert h(\mathbf{x})-\Bbb{E}[h(\mathbf{x})] \rvert \geq r\right) \leq 2\exp\left(-\frac{1}{2}\left(\frac{t}{\mathcal{L}}\right)^2\right). $$

We are interested in the function $h(\mathbf{x})=\max(0, 1-y(\mathbf{w}^\top\mathbf{x}+b))$, where $y\in\{\pm1\}$, and $\mathbf{w}$, $b$ are given parameters.

We can easily show that $h$ is $\mathcal{L}$-Lipschitz with respect to the Euclidean norm, i.e., $$ \lvert f(\mathbf{x})-f(\mathbf{y})\rvert\leq\mathcal{L}\lVert\mathbf{x}-\mathbf{y}\rVert, $$ where $\mathcal{L}=\lVert\mathbf{w}\rVert$. This means that $$ P\left(\lvert h(\mathbf{x})-\Bbb{E}[h(\mathbf{x})] \rvert \geq r\right) \leq 2\exp\left(-\frac{1}{2}\left(\frac{r}{\lVert\mathbf{w}\rVert}\right)^2\right). $$

1 Answers 1

3

The fact that the $x_i$'s and therefore the $h(x_i)$'s are i.i.d. is whats important (though the Gaussianity of $x_i$ can be helpful in some techniques shown in the linked references). The stuff below can be extended to the non-i.i.d. setting in some cases, but it is much more painful.

A) Under mild conditions (basically, just $E[h(X)]$ exists and some moments or other requirements depending on the variant you're looking at), you can find a (weak/strong) law of large numbers that says that $\tilde{h} \to E[h(X)]$ (in probability/mean square/almost surely/whatever). A central limit theorem would say that under mild conditions, replacing $N$ in the denominator with $\sqrt{N}$ and subtracting the mean would give you convergence in distribution to a Gaussian.

B) You can use concentration inequalities (e.g. a Chernoff bound being the classic one which most things are derived from) to bound the deviation of $\tilde{h}$ from the mean as a function of sample size. Which concentration inequality to use depends on the context. These give you results like $P(| \tilde{h} - E[h] | \geq \epsilon) \leq f(\epsilon,n)$. Many machine learning theory textbooks/note sets cover this sort of stuff (such as this or this or this). One reference is Concentration Inequalities: A Nonasymptotic Theory of Independence, by Boucheron, Lugosi and Massart. Another nice reference is the book by Sason and Raginsky. And another nice reference is Concentration of Measure for the Analysis of Randomized Algorithms by Dubhashi and Panconesi. This book called High Dimensional Probability For Mathematicians and Data Scientists by Roman Vershynin is also interesting.

You can also find a large deviations principle (LDP) in some cases (e.g. Cramer's Theorem) to see the asymptotic scaling of the deviation from mean probability with sample size. These give you results like $\lim_{n \to \infty} \frac{P(\tilde{h} \geq E[h] + \delta)}{n} = - I(\delta)$ where $I$ is a function known as a rate function. The standard reference these days is Dembo & Zeitouni's Large deviations: Techniques and Applications.

  • 0
    Thank you very much for your excellent, excellent answer! I edited my question based on this. I think that the most appropriate inequality is McDiarmid’s, but I still have some issues. If you wish, could you take a look? Thank you so much, your help is priceless!2017-01-20
  • 0
    And just a clarification: you wrote $P(\lvert \tilde{h}-\Bbb{E}[h])\geq r$. Is this correct? In theorem it's $P(\lvert h-\Bbb{E}[h])\geq r$. Thanks :)2017-01-20
  • 1
    Its correct as written -- $E[\tilde{h}] = E[h]$ (just use linearity of expectation). Hinge loss (or the identity) isn't going to give you bounded differences with Gaussian data, so McDiarmid isn't appropriate (at least directly; truncation may help depending on the application). As for $h$ being the identity, the Chernoff bound is easy to work with in 1 dimension: $P(X \geq x) = P(s X \geq s x) = P(e^{sX} \geq e^{sx}) \leq E[e^{sX}]/e^{sx}$ for $s>0$.2017-01-20
  • 1
    In most cases, you'll need to do some work to get a concentration inequality to work out for your application. See for example, Chapter 3, section 1 of the Vershynin book I updated the answer with.2017-01-20
  • 0
    Thank you very much for the clarification! Btw, I just mentioned the identity function above just as a simple example, which however is wrong, since it's not real-valued (my bad -- I'm going to correct the question later). What I want to use is the "hinge-loss"; some affine function $\mathbf{w}^\top\mathbf{x}+b$. So, it seems that McDiarmid's inequality does not work (due to the bounded differences condition), I'm going to look for another one (now, I'm goinf to study Sect.3). But there is again an issue I don't understand. How is the sampling size $N$ involved in the process? Thank you!2017-01-20
  • 0
    I have just edited the original post in order to describe what I have already tested and what remain open issues. Actually, the most tricky part for me is how to introduce the sampling size $N$ in this process. The aim is to be able to decide on the sampling size, based on the desired error and given dimenionality. This is to decide on whether it's better to sample from Gaussian or compute the mean from a formula that I have developed for it.2017-01-20
  • 0
    Just once again: I didn't get how $P(|\tilde{h}-\Bbb{E}[h]|)$ is equal to $P(|h-\Bbb{E}[h]|)$. Could you please help me a bit on this? The theorem statement clearly uses the later form, how could I use the first one? I just remind that $\tilde{h}=\sum_{i=1}^{N}{h(x_i)}$, I can't see how the linearity of the expectation would help here; I see no expectation. Many thanks for your time and help :)2017-01-23