0
$\begingroup$

In optimization, gradient and hessian information are used to intelligently find local minima. Could the derivative of the hessian also be used in optimization algorithms?

  • 2
    An extremely broad question. So, a broad answer: the $(n+1)$-th Taylor term can be used to estimate the error of approximating the function using the first $n$ terms.2017-02-07
  • 0
    I'm interested in why most optimization algorithms only use gradients and hessians, instead of further exploring derivatives.2017-02-07

2 Answers 2

1

Taking a narrow view of your question: In gradient decent, you're using a few facts. One is that local minima appear at points where the gradient is zero. (The may also appear at boundaries, but in many problems where this type of algorithm is used, you have reason to believe that it's not at the boundary.) Another is that, at a point with 0 gradient, the local extrema have Hessians with eigenvalues all of the same sign. The sign in this case tells you if it's a min or a max. The derivative of the Hessian will not give you more information than that as far as characterizing the point that you're currently evaluating. Third, if you're not already at a min, taking a "small enough" step in the direction of the negative gradient vector will move you to a spot that is closer to a local minimum. This last fact can be confirmed by taking a one-term Taylor series of $f(x-\eta \nabla f)$ around the point $x$ for "small" $\eta$.

Taking a broader view: You presumably could use information about higher derivatives to move through the space during your "descent" process in a smarter way. For example, you might be able to optimize the size of the step you take by looking at errors as bounded by higher-order terms in the Taylor series mentioned above. Maybe someone will note such an algorithm specifically. The practical question would be whether the additional computation and complication of computing higher derivatives would be worth the trouble, both in the set-up of the problem and in the actual execution of the algorithm.

3

You often use a quadratic approximation of your target-function and minimize this. By this, you will minimize $f(x_k) + \nabla^T x + x^T H(x_k)x$. This can easily be expressed in matrix-calculus and can easy find $dx = H^{-1}\nabla$ as direction. Ofcourse you could use a third-degree polynomial to approximate your target-function, but this is not as easy to write with matrices.

In addition, we can solve linear systems really efficient.