84
$\begingroup$

$f(x_1,x_2,...x_n):\mathbb{R}^n \rightarrow \mathbb{R}$ The definition of the gradient is $ \frac{\partial f}{\partial x_1}\hat{e}_1 +\ ... +\frac{\partial f}{\partial x_n}\hat{e}_n$

which is a vector.

Reading this definition makes me consider that each component of the gradient corresponds to the rate of change with respect to my objective function if I go along with the direction $\hat{e}_i$.

But I can't see why this vector (defined by the definition of the gradient) has anything to do with the steepest descent.

Why do I get maximal value again if I move along with the direction of gradient?

10 Answers 10

92

Each component of the gradient tells you how fast your function is changing with respect to the standard basis. It's not too far-fetched then to wonder, how fast the function might be changing with respect to some arbitrary direction? Letting $\vec v$ denote a unit vector, we can project along this direction in the natural way, namely via the dot product $\text{grad}( f(a))\cdot \vec v$. This is a fairly common definition of the directional derivative.

We can then ask in what direction is this quantity maximal? You'll recall that $\text{grad}( f(a))\cdot \vec v = |\text{grad}( f(a))|| \vec v|\text{cos}(\theta)$

Since $\vec v$ is unit, we have $|\text{grad}( f)|\text{cos}(\theta)$, which is maximal when $\cos(\theta)=1$, in particular when $\vec v$ points in the same direction as $\text{grad}(f(a))$.

32

Other answers are correct in using the directional derivative to show that the gradient is the direction of steepest ascent/descent. However, I think it is instructive to look at the definition of the directional derivative from first principles to understand why this is so (it is not arbitrarily defined to be the dot product of the gradient and the directional vector).

Let $f(\mathbf{x}):\mathbb{R}^n \rightarrow \mathbb{R}$. The partial derivatives of $f$ are the rates of change along the basis vectors of $\mathbf{x}$:

$\textrm{rate of change along }\mathbf{e}_i = \lim_{h\rightarrow 0} \frac{f(\mathbf{x} + h\mathbf{e}_i)- f(\mathbf{x})}{h} = \frac{\partial f}{\partial x_i}$

Each partial derivative is a scalar. It is simply a rate of change.

The gradient of $f$ is then defined as the vector:

$\nabla f = \sum_{i} \frac{\partial f}{\partial x_i} \mathbf{e}_i$

We can naturally extend the concept of the rate of change along a basis vector to a (unit) vector pointing in an arbitrary direction. Let $\mathbf{v}$ be such a vector, i.e., $\mathbf{v} = \sum_{i} \alpha_i \mathbf{e}_i$ where $\sum_{i} \alpha_i^2 = 1$. Then:

$\textrm{rate of change along }\mathbf{v} = \lim_{h\rightarrow 0} \frac{f(\mathbf{x} + h\mathbf{v}) - f(\mathbf{x})}{h}$

Again, this quantity is a scalar.

Now, it can be proven that if $f$ is differentiable at $\mathbf{x}$, the limit above evaluates to: $(\nabla f) \cdot \mathbf{v}$. This is a dot product of two vectors, which returns a scalar.

We know from linear algebra that the dot product is maximized when the two vectors point in the same direction. This means that the rate of change along an arbitrary vector $\mathbf{v}$ is maximized when $\mathbf{v}$ points in the same direction as the gradient. In other words, the gradient corresponds to the rate of steepest ascent/descent.

  • 2
    @jeremyradcliff Yes exactly, I'm saying the magnitude should be 1. I left out the square root precisely because $1^2 =1$.2016-09-26
23

Consider a Taylor expansion of this function, $f({\bf r}+{\bf\delta r})=f({\bf r})+(\nabla f)\cdot{\bf\delta r}+\ldots$ The linear correction term $(\nabla f)\cdot{\bf\delta r}$ is maximized when ${\bf\delta r}$ is in the direction of $\nabla f$.

20

The question you're asking can be rephrased as "In which direction is the directional derivative $\nabla_{\hat{u}}f$ a maximum?".

Assuming differentiability, $\nabla_{\hat{u}}f$ can be written as:

$\nabla_{\hat{u}}f = \nabla f(\textbf{x}) \cdot \hat{u} =|\nabla f(\textbf{x})||\hat{u}|\cos \theta = |\nabla f(\textbf{x})|\cos \theta$

which is a maximum when $\theta =0$: when $\nabla f(\textbf{x})$ and $\hat{u}$ are parallel.

5

Each component of the derivative $ \frac{\partial f}{\partial x_1}\ ... \frac{\partial f}{\partial x_n}$ tells you how fast your function is changing with respect to the standard basis.
It's now possible to make a basetransformation to an orthogonal base with $ n-1 $ base Directions with $0$ ascent and the gradient direction. In such a base the gradient direction must be the steepest since any adding of other base directions adds length but no ascent.

For a 3 dimensional Vector space the base could look like this $ \left( \left( \begin{matrix} \partial x_2 \\ -\partial x_1 \\ 0 \end{matrix} \right) \left( \begin{matrix} \partial x_1 \\ \partial x_2 \\ -\dfrac{(\partial x_1)²+(\partial x_2)²}{\partial x_3} \end{matrix} \right) \left( \begin{matrix} \partial x_1 \\ \partial x_2 \\ \partial x_3 \end{matrix} \right) \right) $ By complete induction it can now be shown that such a base is constructable for an n-Dimensional Vector space. $ \left( \left( \begin{matrix} \partial x_2 \\ -\partial x_1 \\ 0 \\ 0 \end{matrix} \right) \left( \begin{matrix} \color{blue}{\partial x_1 \\ \partial x_2} \\ -\dfrac{(\partial x_1)²+(\partial x_2)²}{\partial x_3} \\ 0 \end{matrix} \right) \left( \begin{matrix} \color{blue}{\partial x_1 \\ \partial x_2} \\ \color{green}{\partial x_3} \\ -\dfrac{(\partial x_1)²+(\partial x_2)²+(\partial x_3)²}{\partial x_4} \end{matrix} \right) \left(\begin{matrix} \color{blue}{\partial x_1 \\ \partial x_2} \\ \color{green}{\partial x_3} \\ \color{orange}{\partial x_4} \end{matrix} \right) \right) $ One can see here that the first Basevector demands the first 2 Elements of the following Basevectors to be $\partial x_1$ & $\partial x_2$ because of the orthogonal condition,
similarly the 2nd vector demands all the 3rd elements of the following vectors to be $\partial x_3$
as does the 3rd vector for the 4th element them being $\partial x_4$.

If another dimension is added the n+1 Element of the n$th$ Vector needs to be $-\dfrac{(\partial x_1)²+...+(\partial x_n)²}{\partial x_{n+1}}$ to meet the $0$ ascension condition which in turn forces the new n+1$th$ Vector to be of the form $\left(\begin{matrix}\partial x_1 \\ ... \\ \partial x_{n+1}\end{matrix}\right)$ for it to be orthogonal to the rest.

3

Let $\vec v$ be an arbitrary unit vector. Then the change of $f$ by moving in the direction of $v$, starting in point $a$, is given by $grad( f(a)) \cdot \vec v$. We want to find a $\vec v$ for which this inner product is maximal. For the inner product we have the Cauchy–Schwarz inequality $\vec a \cdot \vec b \leq |\vec a||\vec b|$. Now the equality holds when $\vec v = \lambda \; grad(f(a))$, for some $\lambda \in \mathbb{R}$.

2

Let $v=\frac{s}{|s|}$ be a unit vector and assume that $v$ is a descent direction, i.e. $v^T\nabla f(x) <0$. Then $f(x+\lambda v)$ as a function of $\lambda$, describes how this function changes along the direction $v$.

The rate of descent at $x$ along $v$ is given by: $ \frac{d}{d \lambda}f(x+\lambda v)|_{\lambda=0} = v^T \nabla f(x) =\frac{s^T}{|s|}\nabla f(x) \equiv \frac{s^T}{|s|}g$ So we want to find the maximum of this quantity as a function of $s$. Differentiating the above wrt $s$ and setting it equal to zero, we get (noting that $\nabla_s|s| =\frac{s}{|s|}$): $g=(g^T v)v\equiv av$.

Taking the Euclidean norm: $|g|=|a||v|=|a| \Rightarrow a=\pm|g|$.

We choose the minus sign to satisfy that $v$ is descent. Hence the direction of the steepest descent is $ v= \dfrac{1}{a}g = -\dfrac{g}{|g|}$

2

Just want to further clarify why the gradient provides the steepest ascent (instead of descent) here. Any differentiable $f$ can be approximated by the linear tangent plane, i.e., $f(\mathbf{x} + h \mathbf{v}) = f(\mathbf{x}) + h \, \nabla f(\mathbf{x})^T \mathbf{v} $ as $h \rightarrow 0$ for any unit-length direction $\mathbf{v}$ with $\parallel \mathbf{v} \parallel =1.$ As $h \downarrow 0$, consider the amount of change $ f(\mathbf{x} + h \mathbf{v}) - f(\mathbf{x}) = h \, \left\{ \, \nabla f(\mathbf{x})^T \mathbf{v} \right\} ~~\in~~ \left[ - h \, \parallel \nabla f(\mathbf{x}) \parallel, ~ h \, \parallel \nabla f(\mathbf{x}) \parallel \right] $ by Cauchy-Swcharz inequality, which reaches its maximum (increase) $(h \, \parallel \nabla f(\mathbf{x}) \parallel)$ when $\mathbf{v} = \nabla f(\mathbf{x}) / \parallel \nabla f(\mathbf{x}) \parallel$ and its minimum (i.e., maximum decrease) $ (-h \, \parallel \nabla f(\mathbf{x}) \parallel) $ if $ \mathbf{v}= - \nabla f(\mathbf{x})/\parallel \nabla f(\mathbf{x}) \parallel$ (the negative gradient direction).