2
$\begingroup$

I have difficulty understanding when minimizing expected squared prediction error:

$$\operatorname{EPE}(\beta)=\int (y-x^T \beta)^2 \Pr(dx, dy),$$

how to reach the solution that

$$\operatorname{E}[yx]-\operatorname{E}[xx^{T}\beta]=0.$$

From A Solution Manual and Notes for the Text: The Elements of Statistical Learning (page 2), I noticed the formula (2):

$$\frac{\partial \operatorname{EPE}}{\partial \beta}=\int 2(y-x^T\beta)(-1) x \Pr(dx, dy).$$

I understand chain rule is used here to solve the derivative, but why the last part is $x$ instead of $x^T$? As $x$ is a column vector and the Jacobian of a constant w.r.t. $x$ should be a row vector.

$$\frac{\partial x^T\beta}{\partial \beta}=x^T.$$

What is missing in my analysis?

According to the Jacobian matrix, shouldn't the $\frac{\partial x^T\beta}{\partial \beta}$ be a row vector instead of a column vector?

$$\frac{\partial EPE}{\partial \mathbf{\beta}}= \left (\frac{\partial EPE}{\partial \beta_{1}}, \frac{\partial EPE}{\partial \beta_{2}}, \dots, \frac{\partial EPE}{\partial \beta_{N}} \right )$$

2 Answers 2

0

I think a transpose is missing in the original solution, but it does not affect the final results.

The aim of this task is to minimize the expected prediction error given by: $$\mathrm{EPE}(\beta)=\int (y-x^{T}\beta)\mathrm{Pr}(dx, dy),$$ where $x$ stands for a vector of random variables, $y$ denotes the target random variable, and $\beta$ denotes the parameters (Note the definitions are different from the notations before). Differentiating $\mathrm{EPE}(\beta)$ w.r.t. $\beta$ gives:

$$\frac{\partial \mathrm{EPE}(\beta)}{\partial \beta}=\int 2(y-x^{T}\beta)(-1)x^{T}\mathrm{Pr}(dx, dy).$$

Because $2(y-x^{T}\beta)(-1)$ is $1\times1$, we can rewrite it with its transpose:

$$\frac{\partial \mathrm{EPE}(\beta)}{\partial \beta}=\int 2(y-x^{T}\beta)^{T}(-1)x^{T}\mathrm{Pr}(dx, dy).$$

Solving $\frac{\partial \mathrm{EPE}(\beta)}{\partial \beta}=0$ gives:

$$\\E\left [y^{T}x^{T}-\beta^{T}xx^{T} \right ]=0 \\ E\left [\beta^{T}xx^{T} \right ]=E\left [y^{T}x^{T} \right ] \\ E\left [xx^{T}\beta \right ]=E\left [xy \right ] \\\beta=E\left [xx^{T} \right ]^{-1}E\left [ xy \right ].$$

A more detailed explanation can be found here.

1

Suppose $x^T=\left(\begin{array}{} x_1 & x_2\end{array}\right) $ and $\beta=\left(\begin{array}{} \beta_1 \\ \beta_2 \end{array}\right)$

Then $x^T\cdot \beta=\begin{array}{} x_1\beta_1 + x_2\beta_2 \end{array}$

The derivative w.r.t $\beta$ is

$\frac{\partial x^T\cdot \beta }{\partial \beta}=\left( \begin{array}{} \frac{\partial x^T\cdot \beta}{\partial \beta_1} \\ \frac{\partial x^T\cdot \beta}{\partial \beta_2} \end{array} \right)=\binom{x_1 }{ x_2}=x$

  • 0
    According to the Jacobian matrix, shouldn't the result be a row vector instead of a column vector? Please see my edit.2017-01-30