5
$\begingroup$

Let $X, Y$ be two random variables, with $X$ taking values in $\Bbb R^n$ and $Y$ taking values in $\Bbb R$.

Then we can look at the function $h: \Bbb R^n \to \Bbb R$ given by $$\beta \mapsto \Bbb E[(Y-X^T\beta)^2]$$ It is claimed that the gradient of $h$ is given by $$\nabla h = \Bbb E[2X(X^T\beta-Y)]$$

This seems like a special case of the identity

$$\nabla \Bbb E[f]=\Bbb E [\nabla f]$$

Where the expectation is taken over the mutual distribution of some random variables.

Formally, We want the following: Suppose $X_1,...,X_m$ are random variables returning values in some sets $A_i$ with some given mutual probability distribution. Then for every function $f: \Bbb R^n \times \prod A_i \to \Bbb R$, for every $\beta \in \Bbb R^n$ we can form the random variable $f(\beta, X_1,...,X_m)$ and take its expectation. Taking different values of $\beta$ gives rise to a function $\Bbb R^n \to \Bbb R$. We claim that its gradient is equal to the vector obtained by first fixing the values of $X_1,...,X_m$ and taking the gradient of the resulting function $\Bbb R^n \to \Bbb R$, and this gives a random variable returning values in $\Bbb R^n$, for which we can take the expectation.

  • 1
    Essentially this is differentiating under the integral. I know that this doesn't always work, but I'm an algebraist so I don't have any examples handy.2017-02-08
  • 1
    I was under the impression that the cases where it doesn't work are pathological.2017-02-08

2 Answers 2

3

That is not the most general version. It is just convenient for the proof. It basically says, you may differentiate under expectation.

Consider $f:\Omega \times U \to \mathbb R$, where $(\Omega, \mathcal A, P)$ is some probability space (Or a $\sigma$ finite measure space if you like) and $U\subseteq \mathbb R^n$ is open and contains $0$.

  1. Assume for (almost) every $\omega\in\Omega$ the function $z \mapsto f(\omega, z)$ is differentiable and define $g(\omega, z) = D_{z} f(\omega, z)$.

  2. Assume there is a $P$-integrable function $S:\Omega \to \mathbb R$ with $\|g(\;.\;, z)\| \le S$ (almost) everywhere for everywhere $z\in U$ and define $G(z) = \int g(\;.\;, z) dP$.

  3. Assume $G$ is continuous at $0$.

Then, $F$ is differentiable at $0$ with $$ D F(0) = G(0). $$

Proof:

For (almost) every $\omega\in\Omega$ we have $\|g(\omega,\;.\;)\| \le S(\omega) < \infty$. That is $f(\omega, \;.\;)$ is lipschitz continuous and the fundamental theorem of calculus applies. Thus, for $z\in U$ we have \begin{align*} F(z) - F(0) - G(0)z &= \int f(\omega, z) - f(\omega,0) - g(\omega,0)z \; dP(\omega) \\ &= \int \int_{0}^1 g(\omega, tz)z - g(\omega, 0)z \; dt\, dP(\omega) \\ &=\biggl(\underbrace{\int \int_{0}^1 g(\omega, tz) - g(\omega, 0) \; dt \, dP(\omega)}_{R(z)} \biggr) z. \end{align*} Since $$ \int\int_0^1 \| g(\omega, tz) - g(\omega, 0) \| \; dt \, dP(\omega) \le \int\int_0^1 2 S(\omega) \; dt \, dP(\omega) = 2 \int S \, dP < \infty$$ Fubini's theorem applies and for $z\to 0$ we have $$ R(z) = \int_0^1 \int g(\omega, tz) - g(\omega, 0)\; dP(\omega) \, dt = \int_0^1 G(tz) - G(0) \, dt \to 0. $$ That is, $F$ is differentiable at $0$ with $$ DF(0) = G(0). $$

Notes:

  1. The assumptions are obviously satisfied in your case.

  2. We can relax assumption 1 and 2 to: Assume for almost every $\omega\in\Omega$ the function $f(\omega,\;.\;)$ is locally absolutely continuous with $$ \sup_{z\in U} \int \| D_z f(\;.\;, z) \| \, dP < \infty. $$ However, one need to be careful about the domain of $D_z f$.

  3. We can replace the Fubini type argument with a Dominated Convergence type argument with a slightly stronger assumption 3. This would allow every measure instead of only $\sigma$ finite one.

2

$\beta$ is not a random variable so you can expand the expression and take $\beta$ out as a factor. Then differentiate that expression with respect to $\beta$ and check that the claimed equality holds. It is not more complicated than that. It would be a problem if you couldn't factor $\beta$ out (e.g. $e^{X^{\top}\beta}$). Then you would indeed have to justify interchanging integration and differentiation.

  • 1
    So I factor out $\beta$ by rewriting $\Bbb E[(x^T \cdot \beta)^2]=\Bbb E[\beta^T(xx^T\beta)]=\beta^T \Bbb E[xx^T]\beta$, And now by the product rule $\nabla (\beta^TA\beta)=D(A^T\beta)\beta+(\beta^TA\cdot D(\beta))^T=2A\beta$. Thank you!2017-02-10
  • 0
    @Leullame Yes exactly. You are welcome.2017-02-10