I am studying Machine Learning, but I believe you guys should be able to help me with this!
Basically, we have given a set of training data $\{(x_1,y_1), x(x_2,y_2), ..., (x_n, y_n)\}$, and we need to train a perceptron to fit the data as best as possible. A perceptron here, is a simple model consisting of a weight vector $w$, and the perceptron outputs $w^Tx$ for an input $x$.
We define an error function $ E(w) = \frac{1}{2N} \sum_{i=1}^{N}(t_d - o_d)^2$, where $t_d - o_d$ is simply the difference between the ideal target value, and our output, $w^Tx$.
A way to minimize the error is to compute the gradient, and we obtain
$\frac{\partial E}{\partial w_i} = \frac{1}{N} \sum_{i=1}^{N}(t_d - o_d)(-x_{id})$
Now, in a computer algorithm, I can, on each iteration, update each $w_i$ to $w_i + \eta \frac{\partial E}{\partial w_i}$, but the problem is that computing that is slow, as it ranges over the whole training set of data.
So, what has been invented, is the LMS (Least Mean Squares) rule, that claims that
$\frac{1}{N} \sum_{i=1}^{N}(t_d - o_d)(-x_{id}) \approx (t_d - o_d)(-x_{id}) $
which means I can just use the current training example to perform my gradient descent.
Now, after this intro, I would like to ask for a bit more intuition and formality behind this LMS rule, and why it is a good enough approximation. I guess I would like a bit more explanation on the $\approx$ part of the above equation, and when, and how much it holds. Thanks for any help.