Having gained an informal appreciation for optimization a while back, but only now starting to get into the stepping stones to formally understanding it, I have some contextualization issues I'm hoping someone can help me resolve.
I'm reading a book on dynamical systems as well as taking a statistics class. Now I look back at mean squared error and gradient descent in a dynamical systems as well as a statistical perspective.
- We update the weight matrix $\mathbf{W}$ according to: $$\frac{d}{dt}\mathbf{W} = -\frac{\partial}{dW} E(\mathbf{W})$$ or $$\mathbf{\dot{W}} = -E'(\mathbf{W})$$ which is symbolically equivalent to the dynamical system $\dot x = f(x)$
- However, something about the process is probabalistic, although I cannot formally point my finger at something in particular with certainty. I see a bunch of elements from statistics going on but I am having trouble making a bijection between the statements being made by the learning equations and concepts elsewhere in statistics. For example, the definition of mean squared error looks suspiciously like a variance: $$E(W) = \frac{1}{n}\sum^n_i (\hat{Y} - Y_i)^2$$ I mean I can even imagine myself writing something like $Y_i = E[\hat{Y_i}]$, i.e. the expectation.
I'm having trouble figuring out to which distribution the variance belongs? The weights, the outputs?