0
$\begingroup$

I am trying to solve the problem given below using sequential quadratic programming (Newton's approach). The result is $x_{1}=5$, $x_{2}=5$ which cannot be the optimal point. As substituting $x_{1}=5$, $x_{2}=0$, returns a value that is greater than that given by the technique. Is this some kind of limitation of SQP or is there an error at my end. I want to maximize the sum $log_{2}(1+\dfrac{x1}{x2+0.1})+log_{2}(1+\dfrac{x2}{x1+0.1})$ but the technique is maximizing their individual values. \begin{align} \max_{x1,x2}\quad log_{2}(1+\dfrac{x1}{x2+0.1})+log_{2}(1+\dfrac{x2}{x1+0.1})\\ s.t\quad x1\geq0,x2\geq0\\ x1\leq5,x2\leq5 \end{align} I'm not a mathematician and am new to the topic of SQP, so kindly guide me even if you don't know the exact answer. Anything that you would say might lead me in the right direction. Thank you.

1 Answers 1

1

Newton's approach main issues

Newton's approach can only guarantee a local convergence.

This method can diverge in a fair amount of situations if no prior precaution is taken. In general, we only ever use it if the function to optimize is linear enough, at least around the optimal points, and if the starting point is not too far from the actual optimum. It can be shown that if the optimal point is $x^*$, then the initial point $x_0$ has to be chosen such that

$$ \left|{\frac {f''(x_{0})}{f'(x_{0})}}\right|< 2\left|{\frac {f''(x^* )}{f'(x^* )}}\right| .$$

From an iteration expression $$x_{n+1}=x_{n}-{\frac {f(x_{n})}{f'(x_{n})}},$$ we can also see that if ever the derivative approaches zero, the numerical stability of the method will suffer. This is also true if the second derivative (or hessian matrix) is poorly conditioned.

Common alternatives for global convergence

Global convergence can be achieved by locally ensuring one keeps the "right properties".

In the case of non-linear functions like yours, one usually rather uses trust region algorithms by approximating the objective function with a descriptive model (typically a quadratic model derived from the Taylor expansion of the objective function), which guarantee a global convergence.

The most commonly used ways to compute the descent direction of the next iterate are either

  • easy to implement and quick to compute: using Cauchy's step (namely maximizing the function along the steepest descent within the trust region);
  • more tricky and costful: using the Moré-Sorensen approach, which takes into account the curvature of the model function at the current iterate, and thus is much more precise and optimal. This method uses a diagonal decomposition of the hessian matrix (as well as a Newton's approach for solving the zero of a non-linear function). The hard case of this method is particularly tricky to handle.

The internet is full of good resources about these methods.

Last but not least, these methods are only useful to solve unconstrained problems. They are incorporated into generalized methods such as the augmented Lagrangian method to solve constrained problems.


Solving a constrained problem using Cauchy's step method

1. Guaranteeing the global convergence of the algorithm

As mentionned above, the main problem of Newton's approach is that it can diverge in a lot of — pretty common — situations. To solve this issue, we will proceed by locally ensuring that we're still converging. A good idea to do so is to approach the objective function by a model that will behave as we want around the current iterate. Around here means within a region of some radius $\Delta_k$ that we'll have to compute. Let's take as our model, for instance, the second order Taylor's expansion of $f$ at the current iterate $x_k$:

$$ m_k(x_k + v) = q_k(v) = f(x_k) + \nabla f(x_k)^T v + \frac{1}{2} v^T \nabla^2 f(x_k) v.$$

As you can see, for each iterate we get a new quadratric function, that we know behaves really good in the neighborhood of $x_k$.

Our main goal in this part is to determine a trust region of this model, or in other words the radius $\Delta_k$ of the region where we estimate the model describes the objective function in an acceptable way. For this, let us define the approximation rate

$$\rho_k = \frac{f(x_k) - f(x_k + v_k)}{m_k(x_k) - m_k(x_k + v_k)}.$$

Thanks to this rate, we can evaluate the efficiency of the model and adapt the trust region accordingly;

  • if $\rho_k$ is close to $1$ — let's say if $\rho \in [\eta_1, 1]$ for some $\eta_1 > 0$ — our model is close to $f$ and we can try to make bigger steps by making the trust region wider (in practice we will just dilate $\Delta_k$ of some factor $\gamma_2 > 1$).
  • if $\rho_k \in [\eta_1, \eta_2]$, we consider the approximation is good enough with the current $\Delta_k$ radius but it wouldn't be good anymore if the trust region was wider. Thus we won't change $\Delta_k$ in this case.
  • if $\rho_k \in [0, \eta_1]$, the model isn't close enough to $f$. This means we should shrink the trust region by some factor $\gamma_1 < 1$.

By defining $\Delta_{max} > 0, 0 < \gamma_1 < 1 < \gamma_2$ and $0 < \eta_1 < \eta_2 < 1$, we can put the decision rule more rigorously as follows.

$$ \Delta_{k+1} = \begin{cases} \min \{\gamma_2 \Delta_k, \Delta_{max}\} & \text{if} & \rho_k \geq \eta_2 \\ \Delta_k & \text{if} & \rho_k \in [\eta_1, \eta_2] \\ \gamma_1 \Delta_k & \text{otherwise.} \end{cases}$$

We should step forward if and only if the model is close to the function at the current iterate, otherwise we should only shrink the trust region and not move before we reach a satisfying $\rho_k$. The decision rule for the next iterate is thus:

$$ x_{k+1} = \begin{cases} x_k + v_k & \text{if} & \rho_k \geq \eta_1 \\ x_k & \text{otherwise.}\end{cases}$$

2. Determining the next iterate

Now that we've got our decision rules that will ensure we keep converving no matter where we started, we need to determine the next iterate, which means computing $v_k$. There exist a lot of methods to do this, but we'll proceed by using the Cauchy's step method. It is an approximation method (which as such does not always provide the optimal answer) which consists in restraining the research of the minimum of the model along the gradient of the function. One can prove that, if not optimal, proceeding this way is always at least sufficient.

The problem to solve is

$$ \begin{cases} \min & q_k(v)) \\ \text{s.t.} & v = -t \cdot \nabla f(x_k) \\ & t > 0 \\ & \|v\| \leq \Delta_k.\end{cases}$$

Let $g = \nabla f(x_k)$. That's equivalent to minimizing $\varphi(t) = q_k(-tg) = \frac{1}{2} t^2 g^T H g - t g^T g$, which is a polynomial of order 2. We can compute its curvature $c = g^T H g$. The maximal step we can make along the gradient without going out of the trust region is $t_{max} = \Delta_k / \|g\|$. Two possible cases:

  • if $c <= 0$, $v = -t_{max} \cdot g$;
  • if $c > 0$, compute $t = \|g\|^2 / c$. If $t > t_{max}$, then $v = -t_{max} \cdot g$ too (one must stay within the trust region), otherwise $v = -t \cdot g$ (the minimum is reached stricly inside the trust region).

Congratulations, you are now able to solve any unconstrained problem provided you can compute the gradient and hessian of the model you've chosen, without having to care about how to choose the initial point $x_0$ — even though chosing smartly would make your algorithm quicker.

3. Extending to constrained problems (with equalities only)

We will use the augmented Lagrangian method, which has the main advantage of being very numerically stable. It is derived from what we call penalty methods. As you probably already know, the key concept behind the Lagrangian formalism is to transform constrained problem into unconstrained problems. In our case, that would be sweet because we could use our brand new trust region algorithm!

To achieve this, one can prove that iteratively solving the unconstrained problem

$$ \min_{x \in \mathbb{R}^n} L (x_k, \lambda_k, \mu_k) = f(x_k) + \lambda_k^T c(x_k) + \frac{\mu_k}{2} \| c(x_k) \|^2 \tag{P} $$

finishing when $\| \nabla_x L(\cdot, \lambda_k, \mu_k) \| \leq \varepsilon$ for some $\varepsilon > 0$.

If the algorithm did converge (that is to say if the non-augmented Lagrangian approached zero), then $x_k$ is an approximate solution of the constrained problem.

I won't get into the theorical details here as you can find them elsewhere. Numerical analysis of the method has shown that setting $\alpha = 0.1, \beta = 0.9, \mu_0 > 0, \epsilon_0 = 1/\mu_0, \tau > 0, \hat{\eta}_0 = 0.1258925$ such that $\eta_0 = \hat{\eta}_0 / \mu_0^{\alpha} = 0.1$ leads to the best results. Using those variables, the cases are

  1. solve the unconstrained problem $(P)$ using a local algorithm (like Cauchy's step with trust region) with $x_k$ as initial point (and whatever $x_0$ the first time). Stop when $\| \nabla_x L(\cdot, \lambda_k, \mu_k) \| \leq \epsilon_k$. If the norm of the non-augmented Lagrangien is close to zero, stop the whole algorithm, as $x_k$ is an approximate solution of the constrained problem. Otherwise, go to step 2.
  2. If $\|c(x_{k+1}\| \leq \eta_k$, update the multipliers:

$$ \begin{cases} \lambda_{k+1} & = & \lambda_k + \mu_k \cdot c(x_{k+1}) \\ \mu_{k+1} & = & \mu_k \\ \varepsilon_{k+1} & = & \varepsilon_k / \mu_k \\ \eta_{k+1} & = & \eta_k / \mu_k^{\beta}.\end{cases}$$

Otherwise, update the penalty parameters:

$$ \begin{cases} \lambda_{k+1} & = & \lambda_k \\ \mu_{k+1} & = & \tau\mu_k \\ \varepsilon_{k+1} & = & \varepsilon_0 / \mu_{k+1} \\ \eta_{k+1} & = & \hat{\eta}_0 / \mu_{k+1}^{\alpha}.\end{cases}$$

This method only works for equality constaints.

4. Extending to problems constrained with inequalities

The common way to solve this kind of problems is to let $$z^2 = < z, z > = \begin{pmatrix}z_1^2 \\ \vdots \\ z_n^2 \end{pmatrix}.$$

This allows you to rewrite

$$ \begin{cases} \min & f(x) \\ s.t. & h(x) = 0 \\ & g(x) \leq 0 \end{cases} \tag{P'} $$

as

$$ \begin{cases} \min & f(x) \\ s.t. & h(x) = 0 \\ & g(x) + z^2 = 0 \end{cases} \tag{P''} $$

and solve $(P'')$ as a problem constrained with equalities.


Going further: Newton's method is not always bad

Very often, it can be that we actually have to deal with "good-looking functions". In such cases, the main issue we have to overcome is practically rather the dimension of the problem.

In Newton's method, one uses $x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k)$. In trust regions, one minimizes $ f(x_k) + \nabla f(x_k)^T v + \frac{1}{2} v^T H_k v$ where $H_k$ approaches $\nabla^2 f(x_k)$.

For the current high dimensions (between $10^6$ and $10^7$ variables), one must find efficient methods to approach $\nabla^2 f(x_k)$ and to solve $\nabla^2 f(x_k) v = -\nabla f(x_k)$.

The quasi-Newton methods start from points $x_0, x_1$ and use $\nabla f(x_0), \nabla f(x_1)$ as well as an approximation $H_0 = \alpha I$ of $\nabla^2 f(x_0)$. Using Taylor's integral rest form, one has

$$ \begin{align}\nabla f(x_1) - \nabla f(x_0) & = \int_0^1 \nabla^2 f(x_0 + v(x_1 - x_0) (x_1 - x_0) dv \\ & = \left[ \int_0^1 \nabla^2 f(x_0 + v(x_1 - x_0)) dv \right] (x_1 - x_0). \end{align}$$

We then try to solve the optimization problem

$$ \begin{cases} \min \| \Delta H \|_F \\ \Delta H = \Delta H^T \\ (H_0 + \Delta H)v = y. \end{cases} \Longrightarrow H_1 = H_0 + \Delta H_{opt}.$$

One can show the solution of this problem is given by

$$ \Delta H_{opt} = \frac{y - H_1v)v^T + v(y - H_1 v)^T}{v^T v} - \frac{v^T(y - H_1 v)}{(v^Tv)^2}. $$

This is numerically easy to compute, and we can directly deduce $x_2 = x_1 - \alpha H_1^{-1} \nabla f(x_1)$.

The last question to answer is: how to compute efficiently $H_k^{-1}$? The best way is to use Sherman-Morrison's formula:

$$(A+uv^{T})^{-1}=A^{-1}-{A^{-1}uv^{T}A^{-1} \over 1+v^{T}A^{-1}u}.$$

For further readings, see

  • 0
    Thank you. Can you provide me a link or example to use Cauchy's step or Moré-Sorensen approach. The links I found online are mostly full of theory( which is hard for me to understand). Is there an example where someone had solved such problem using these techniques?2017-02-12
  • 0
    The technique I am using is as following. \begin{align} a=gradient\, of second \, constraint\\ then\\ H=hessian(Lagrangian)\\ A=[H\, a';a\, 0]\\ b=gradient\, of\, lagrangian \,w.r.t\, x_{n}\\ c=gradient \,of\, lagrangian\, w.r.t\, \lambda\\ where\, \lambda\, is\, a\, lagrangian\, multiplier\\ then\, at\, each\, iteration\\ [k j]=A^{-1}\,[b;c]\\ x_{n}^{i+1}=x_{n}^{i+1}+delta\, k\\ \lambda_{n}^{i+1}=\lambda_{n}^{i}+delta\, j\\ where\, delta\, is\, the\, step\, size \end{align}2017-02-12
  • 0
    As I said, these approaches are not sufficient to solve a constrained problem like yours. I will edit my post to give you the main ideas of each step of solving such problems (it's pretty long so might take some time).2017-02-12