I have a basic question that I can't seem to find an answer for -- perhaps I'm not wording it correctly. Suppose that we have an $n$-by-$d$ matrix, $X$ that represents input features, and we have a $n$-by-$1$ vector of output labels, $y$. Furthermore, let these labels be a noisy linear transformation of the input features:
$$ y = X \cdot w + \epsilon$$
Where $w$ represents a $d$-dimensional set of true weights and $\epsilon$ is i.i.d. zero-mean Gaussian noise, and I am interested in inferring $w$ using ordinary least squares linear regression (OLS).
I would like to prove that as the number of data points, $n$, increases, the weights predicted by OLS converges in probability to the true weights $w$ -- say, in $l_2$-norm.
Can anyone help me to go about proving this, or point me to references? Thanks!
As pointed out in the comments, it is important to keep in mind how $X$ is constructed. Let $X$ be a random matrix.