0
$\begingroup$

I want an efficient algorithm for solving the following minimization:

$Argmin~_{u} \sum_i ~ (Max(0, u \cdot x_i) -r_i)^2 $

Here $u$ and $x_i$ are vectors and $r_i$ is a scalar. The dimensionality of the vectors is modest, typically around 10. The $i$ index runs over a set which could be in the millions.

I can solve this by using the convex-concave procedure and reducing it to another subproblem

$Argmin~_{u} \sum_i ~ \theta(u \cdot x_i) ~ [(u \cdot x_i)^2 - r_i (u \cdot x_i)] $

where $\theta(y)$ is the Heaviside function; 1 if y is positive and zero otherwise.

Note that, if the $\theta$ function wasn't there, it would just be a sum of quadratics which has a well-known analytic solution. These are like quadratic functions but once they hit zero, they stay zero. They are only non-zero on one side of a hyperplane. Is there a similar analytic solution to this?

Note that the function is convex so any general-purpose convex solver would work to solve it, but I'm interested in something more efficient that exploits the structure of the functions even if it requires some iteration. My intuition tells me there is some fixed point algorithm that would work or a divide and conquer approach.

  • 1
    All of the functions inside the maximum are nonnegative, so the Max is irrelevant.2017-01-16
  • 1
    Sorry for the typo. Have edited it.2017-01-16
  • 0
    OK, but in fact, this is no longer convex unless $r_i < 0$, and I suspect that not to be the case for "useful" forms of this problem.2017-01-17
  • 0
    Sorry again. I fixed the original problem but forgot to add the part of how I reduced it to a convex subproblem. Now it should be right.2017-01-17
  • 0
    In case people are curious, this is the problem you get when trying to train a one-hidden-layer neural network with RELU link functions and mean square error objective function. I'm trying to use convex optimization and associated methods to replace gradient based training.2017-01-17
  • 0
    can you clarify this last comment? I'm a bit curious. Almost all methods in continuous convex optimisation rely on GD afaik (including ADMM). Additionally, NNs perform well in the presence of massive amounts of data for which computing a gradient (let alone anything else a bit more complex) is too expensive, that GD allows subsampling (i.e. SGD) and that NNs are amenable to SGD training is precisely why NNs managed to blow a lot of other methods out of the water.2017-01-20
  • 0
    In ADMM, gradient descent is done on the dual variable. But usually the subproblems are solved exactly which speeds convergence. For example, the subproblems are usually unconstrained quadratic minimizations which have analytic solutions. GD can have difficulties with problems that are not differentiable everywhere, e.g. RELUs. There are features like ridges and plateaus which cause problems for methods relying only on local information even when 2nd derivatives are included. In short, if you know where the minimum of a function occurs, you want to exploit that, not rely on GD and local info.2017-01-21

1 Answers 1

1

Here is one possibility that I thought of but I'd be interested if there is one even better.

You can use the Alternate Direction Method of Multipliers (ADMM) in consensus form. That is, you treat the $u$ in each sum as if they are independent variables but add a constraint that they are all equal. Using ADMM you can solve each subproblem which is minimizing each term plus a circular quadratic. That has a simple analytic solution. Then you just average them and update a dual variable. This has algorithmic complexity $N_x N_p N_A$ where these are, respectively, the dimensionality of the vector space, the number of points and number of ADMM iteration required to converge within tolerance which is often around 100 or so.

You could accelerate it towards the end by assuming which quadratics will be activated (that will stop changing eventually) and then just use the quadratic minimization formula. You check that the activizations haven't changed at the solution and if not, you're done. If they have you continue with ADMM iterations.

I also wondered if just making assumptions on the activizations and iterating would be a converging algorithm. It probably is but I'm not sure and don't know it's rate of convergence even if it is.