4
$\begingroup$

I have a basic question that I can't seem to find an answer for -- perhaps I'm not wording it correctly. Suppose that we have an $n$-by-$d$ matrix, $X$ that represents input features, and we have a $n$-by-$1$ vector of output labels, $y$. Furthermore, let these labels be a noisy linear transformation of the input features:

$$ y = X \cdot w + \epsilon$$

Where $w$ represents a $d$-dimensional set of true weights and $\epsilon$ is i.i.d. zero-mean Gaussian noise, and I am interested in inferring $w$ using ordinary least squares linear regression (OLS).

I would like to prove that as the number of data points, $n$, increases, the weights predicted by OLS converges in probability to the true weights $w$ -- say, in $l_2$-norm.

Can anyone help me to go about proving this, or point me to references? Thanks!

As pointed out in the comments, it is important to keep in mind how $X$ is constructed. Let $X$ be a random matrix.

  • 0
    Are you aware that OLS is equivalent to maximum likelihood estimation on the coefficients of $w$?2017-01-07
  • 0
    @πr8 : It's equivalent because of the assumption of i.i.d. Gaussians, but one should be able to prove this with far weaker assumptions.2017-01-07
  • 0
    @MichaelHardy fair - i began working through an answer and actually realised part-way through that the Gaussian assumption wasn't totally necessary. was just a gut feel / suggestion for a way of starting with the question2017-01-07
  • 1
    Increasing the sample size $n$ means increasing the number of rows in the matrix $X$. This leaves a big question: In what way does the new row of $X$ vary within the $d$-dimensional row space as more and more new rows keep getting added? Without that, you don't have a well-defined question. In fact, one may ask how best to do this in order to make the least-squares estimate as accurate as possible; that is the problem of design of experiments.2017-01-07
  • 0
    Thank you @MichaelHardy for pointing that out. What I had in mind was that $X$ is a random matrix.2017-01-08

2 Answers 2

2

Here is the probabilistic approach to the proof. Assuming that $X$ is a random matrix and $E\| x \| < \infty$ (and $Ey^2 < \infty)$.

Your minimization problem is $$\min_{w\in \mathbb{R}^d} \| Xw - y \|_2^2. $$ It is straightforward to show that the OLS estimator ("weights" using your terminology) are given by $$ \hat{w}=(X'X)^{-1}X'y, $$ or equivalently $$ \hat{w}= \left( \frac 1 n \sum_{i=1}^n x_i x_i^T\right)^{-1} \left( \frac 1 n \sum_{i=1}^n x_i^T y_i\right). $$ Note that convergence in $L^2$ norm occur iff $\hat{w}$ converge in probability to $w$, thus by using WLLN you can observe that $$ \frac 1 n \sum_{i=1}^nx_ix_i^T \xrightarrow{p}Ex_ix_i^T=E(X'X) $$ as $n\to \infty$, and $$ \frac 1 n \sum_{i=1}^nx_i^Ty_i \xrightarrow{p}E(X'y). $$ As such, by the continuous mapping theorem for $f(X'X, X'y)=(X'X)^{-1}X'y$, you get that $$ \hat{w}\xrightarrow p (E(X'X))^{-1}E(X'y), $$ which are the population ("real") weights of $C(X)$.

  • 0
    You are right. I've deleted the sentence.2017-01-07
  • 0
    See my comment on sample size under the question. How are you dealing with that?2017-01-07
  • 0
    Assuming that $E\|x\|^2 < \infty$ is not enough?2017-01-07
  • 0
    Actually, it seems you're assuming the entries in $X$ are themselves random. I took the question, mentioning "Gaussian noise", to be attributing randomness only to the noise.2017-01-07
  • 1
    Ok. Thanks. I guess I'll add these assumptions to my answer...2017-01-07
  • 0
    Thank you @V.V. I am following your work -- I want to confirm that the estimate for weights should be: $\hat{w}= \left( \frac 1 n \sum_{i=1}^n x_i^T x_i\right)^{-1} \left( \frac 1 n \sum_{i=1}^n x_i^T y_i\right)$, correct? (I switched $x_i^T$ and $x_i$) in the first summation2017-01-08
  • 0
    It depends whether $x_i$ is row or column vector, as $x_i^Tx_i$ mostly stands for a scalar multiplication, however $X'X$ is $\sum_{i=1}^nx_ix_i^T$, where $x_ix_i^T$ is the "design" matrix of $i$th observation.2017-01-08
  • 0
    Hmm, but just to be sure $x_i^Tx_i$ should result in a $d$-by-$d$ matrix, not a $1$-by-$1$ scalar, correct?2017-01-08
  • 0
    For each $i$ you have $x_ix_i ' = $ \begin{pmatrix} x_1^2 & x_1 x_2 & ... & x_1x_d \\ : & : \\ x_nx_1 & ... & ... & x_n^2 \end{pmatrix}2017-01-08
  • 0
    Hmm, I think you mean: \begin{pmatrix} x_1^2 & x_1 x_2 & ... & x_1x_d \\ : & : \\ x_dx_1 & ... & ... & x_d^2 \end{pmatrix} ?2017-01-08
  • 0
    Yes, thanks, Mistyped. :-)2017-01-08
2

Here I'm going to weaken the assumptions: rather than "Gaussian" we'll say they errors have expected value $0$ and finite variance; rather than "identically distributed" we'll say they all have the same variance; and rather than "independent" we'll say "uncorrelated", i.e. all the covariances $\text{are } 0.$

The least-squares estimator of $w$ is $$ \hat w = (X^T X)^{-1} X^T y. $$ (BTW, one must have $\operatorname{rank}(X) = d$; otherwise the matrix inverse above makes no sense and the least-square estimator is not unique.) Then we have $$ \operatorname{var} (\hat w) = (\text{a }d\times d\text{ matrix }) = \operatorname{E}\big( (\hat w - w)(\hat w - w)^T\big) = \sigma^2(X^T X)^{-1} \tag 1 $$ where $\sigma^2$ is the noise variance, and we get this via the identity $$ \operatorname{var}(A\hat w) = A\Big(\operatorname{var}(\hat w) \Big) A^T, \text{ for any } k\times d \text{ matrix } A. $$

Increasing the sample size $n$ means increasing the number of rows in the matrix $X$. This leaves a big question: In what way does the new row of $X$ vary within the $d$-dimensional row space as more and more new rows keep getting added? Without that, you don't have a well-defined question. So here I'll go out on a limb and say that the way we'll do it is we will increase the sample size $n$ by stacking copies of $X$ on top of each other: $$ \begin{bmatrix} X \\ \vdots \\ X \end{bmatrix} = \text{an } (nk)\times d \text{ matrix} \tag 2 $$ and we'll let $k$ keep growing and the $nk$ errors will all be uncorrelated with each other and otherwise satisfy the assumptions. This just means we're doing replications with the same set of $X$ values. In some contexts, this would be considered repetition rather than replication: we get new $y$ values for each row of $X$, rather than getting new rows of $X$ different from those we already had.

  • Show that $(1) \to 0$ as $k\to\infty$ (except that instead of $X$ in $(1)$ we'll have the matrix $(2)$.)
  • Observe that the variances of the components of $\hat w$ are the diagonal elements of $(1).$
  • Finally, apply Chebyshev's inequality.