My question is at the bottom. (Most of the descriptive words come from Chris. Bishop's Neural Networks for Pattern Recognition)
Let $w$ be the weight vector of the neural network and $E$ the error function.
According to the Robbins-Monro algorithm, this sequence: $w_{kj}^{(r+1)}=w_{kj}^{(r)}-\eta\left.\frac{\partial E}{\partial w_{kj}}\right|_{w^{(r)}}$ will converge to a limit, for which: $\frac{\partial E}{\partial w_{kj}}=0.$
In general the error function is given by a sum of terms, each of which is calculated using one of the patterns from the training set, so that $E=\sum_nE^n(w)$ And in applications we just update the weight vector using one pattern at a time $w_{kj}^{(r+1)}=w_{kj}^{(r)}-\eta\frac{\partial E^n}{\partial w_{kj}}$
My question is: Why will the algorithm converge using the last formula? Once we use it to update the $w$, the value of $w$ is changed, and I can't prove the convergence using $\frac{\partial E}{\partial w_{kj}}=\sum_n \frac{\partial E^n}{\partial w_{kj}}$