My question is at the bottom. (Most of the descriptive words come from Chris. Bishop's Neural Networks for Pattern Recognition)
Let $w$ be the weight vector of the neural network and $E$ the error function.
According to the Robbins-Monro algorithm, this sequence: $$w_{kj}^{(r+1)}=w_{kj}^{(r)}-\eta\left.\frac{\partial E}{\partial w_{kj}}\right|_{w^{(r)}}$$ will converge to a limit, for which: $$\frac{\partial E}{\partial w_{kj}}=0.$$
In general the error function is given by a sum of terms, each of which is calculated using one of the patterns from the training set, so that $$E=\sum_nE^n(w)$$ And in applications we just update the weight vector using one pattern at a time $$w_{kj}^{(r+1)}=w_{kj}^{(r)}-\eta\frac{\partial E^n}{\partial w_{kj}}$$
My question is: Why will the algorithm converge using the last formula? Once we use it to update the $w$, the value of $w$ is changed, and I can't prove the convergence using $$\frac{\partial E}{\partial w_{kj}}=\sum_n \frac{\partial E^n}{\partial w_{kj}}$$