6
$\begingroup$

I am writing a program minimizing a real-valued non-linear function of around 90 real variables subject to around 30 non-linear constraints. I found handy explanation in CERN's Data Analysis BriefBook. I've implemented it and it works, but I am not able to derive how they obtained the equations at the bottom. Could anyone please explain how it can be achieved?

Copied from the link:

Trying to minimize $f(x_1,...,x_n)$ subject to $c_1(x_1,...,x_n) = ... = c_m(x_1,...,x_n) = 0$.

Reformulate as $$\partial F/\partial x_1 = \dots = \partial F/\partial x_n = \partial F/\partial\lambda_1 = \dots = \partial F/\partial\lambda_m = 0$$ for $$ F(x_1,\dots,x_n,\lambda_1,\dots,\lambda_m) = f(x_1,\dots,x_n) - \lambda^T c(x_1,\dots,x_n) $$

Using Lagrange multipliers $\lambda$ and Newton-Raphson, they arrive at (1): $$ A\Delta x - B^T\lambda = -a\\ B\Delta x = -c, $$ where $A$ is Hessian of $f$, $a = (\nabla f)^T$ is gradient of $f$ and $B=\nabla c$ is Jacobian of the constraints.

I can't seem to follow them. The way I understand it, they're applying Newton-Raphson to solve $\nabla F = 0$. I believe that $$ \nabla F = \left(a - B^T \lambda \atop -c \right) $$ Then, we need to take derivative of $\nabla F$ for Newton-Raphson. First off, it seems to me they're only taking the derivative w.r.t. $x$ and not to $\lambda$. Why is that?

Even so, this would lead to (1) if the derivative of $\nabla F$ was $\left(A \atop -B\right)$. However, while it's true that $\nabla a = A$ and $\nabla c = B$, I can't understand where the term $-B^T\lambda$ vanished to.

Many thanks to anyone who could shed some light on this.

2 Answers 2

3

I also was interested in a proof of the Newton-Raphson+Lagrange multiplier result from the CERN page. Here is my derivation:

enter image description here

enter image description here

  • 1
    See the [editing help](http://math.stackexchange.com/editing-help#latex) on how to type in answers. Images should be avoided for formulas.2013-12-18
  • 0
    +1. Thank you very much, I'm finally beginning to understand it. Could you please clarify why it's safe to neglect the second derivative of the constraints (which happens twice in your derivation, if I understand correctly)?2013-12-20
2

Check Sequential Quadratic Programming in any good optimization book. When there are only equality constraints, it really boils down to applying Newton's method to the system you gave. With $F(x,\lambda) := f(x) - \lambda^T c(x)$ (the Lagrangian), differentiating yields $$ \nabla F(x,\lambda) = \begin{bmatrix} \nabla f(x) - J(x)^T \lambda \\ - c(x) \end{bmatrix}, $$ where $$ J(x) = \begin{bmatrix} \nabla c_1(x)^T \\ \vdots \\ \nabla c_m(x)^T \end{bmatrix} $$ is the Jacobian of $c$ at $x$. Applying Newton's method to $\nabla F(x,\lambda) = 0$ requires that we differentiate one more time: $$ \begin{bmatrix} H(x,\lambda) & J(x)^T \\ J(x) & 0 \end{bmatrix} \begin{bmatrix} \Delta x \\ -\Delta \lambda \end{bmatrix} = - \begin{bmatrix} \nabla f(x) - J(x)^T \lambda \\ - c(x) \end{bmatrix}. $$ There are all sorts of difficulties that come in, ranging from the need to perform a linesearch for a merit function, to avoiding the Maratos effect. A good book such as Numerical Optimization (Nocedal & Wright, Springer) will explain everything.