0
$\begingroup$

Considering f a function whose minimum we want to find, a simple gradient descent algorithm would have

x_new = x_old - step * df/dx

If x is measured in [input] units and f(x) is measured in [output] units, the above equation suggests the step is measured in [input**2] / [output].

How should one interpret this? I previously naively thought the step was adimensional, but apparently not.

Now the equation looks a bit odd to me, reading like

x_new = x_old - small_change_in_y

Which it certainly isn't.

The equation still ticks a number of requirements though (moves x in the right direction and slows when the optimum is near)

Thanks!

1 Answers 1

0

Perhaps this will help. This answer is assuming that step refers to the step length.

You are correct that x_new and x_old are measured in the input space, and f(x) is measure in the output space. However, we are considering $\frac{df}{dx}$ and not $f$, so let's look closer at $\frac{df}{dx}$.

Just like f(x), $\frac{df}{dx}$ is a function on the input space. Therefore, let me write it as $\frac{df}{dx}(w)$, where $w$ is some point in the input space. Intuitively, if we are currently at $w$ then $\frac{df}{dx}(w)$ is the direction in the input space which $f$ increases the most. More precisely, $\frac{df}{dx}(w)$ is the gradient of $f$ at $w$. As an example, consider $f:\mathbb{R}^2\to \mathbb{R}$ defined by $$f\begin{pmatrix}x\\y\end{pmatrix}= x^2+y.$$ The gradient is $$\frac{df}{d_{x,y}}\begin{pmatrix}w\\z\end{pmatrix}= \begin{pmatrix}2w\\1\end{pmatrix}.$$ So this says that at point $\begin{pmatrix}w\\z\end{pmatrix}$, the direction that provides the steepest ascent is $\begin{pmatrix}2w\\1\end{pmatrix}$. For example, if we are currently at the iterate $x_{old} = \begin{pmatrix}0\\0\end{pmatrix}$, then the direction we want to steer our algorithm if we want to ascend the quickest is $$\frac{df}{d_{x,y}}\begin{pmatrix}w\\z\end{pmatrix} = \begin{pmatrix}0\\1\end{pmatrix}.$$ Therefore, back to your question, the update formula reads like this

x_old and x_new are points in the input space

df/dx (better written as $df/dx(x_old)$) is a direction in the input space

step is just a scalar telling you the step length to step in a certain direction.

  • 0
    Then would it be fair to say the equation is a lazy version of `x_new = x_old - step × grad(f)/norm(grad(f))` where step is measured in [input]? That would mean the step is literally how much you want to move by, and the other term is the direction of the movement2017-01-27
  • 0
    Correct. Also, the value 'step' is typically positive. Intuitively, you want to step forward in some direction. Mathematically, This is because the term -grad(f) is actually the direction of steepest descent. So we are actually stepping in the direction of steepest descent (i.e. -grad(f)) and the amount is dictated by step.2017-01-27