Let $X, Y$ be two random variables, with $X$ taking values in $\Bbb R^n$ and $Y$ taking values in $\Bbb R$.
Then we can look at the function $h: \Bbb R^n \to \Bbb R$ given by $$\beta \mapsto \Bbb E[(Y-X^T\beta)^2]$$ It is claimed that the gradient of $h$ is given by $$\nabla h = \Bbb E[2X(X^T\beta-Y)]$$
This seems like a special case of the identity
$$\nabla \Bbb E[f]=\Bbb E [\nabla f]$$
Where the expectation is taken over the mutual distribution of some random variables.
Formally, We want the following: Suppose $X_1,...,X_m$ are random variables returning values in some sets $A_i$ with some given mutual probability distribution. Then for every function $f: \Bbb R^n \times \prod A_i \to \Bbb R$, for every $\beta \in \Bbb R^n$ we can form the random variable $f(\beta, X_1,...,X_m)$ and take its expectation. Taking different values of $\beta$ gives rise to a function $\Bbb R^n \to \Bbb R$. We claim that its gradient is equal to the vector obtained by first fixing the values of $X_1,...,X_m$ and taking the gradient of the resulting function $\Bbb R^n \to \Bbb R$, and this gives a random variable returning values in $\Bbb R^n$, for which we can take the expectation.