3
$\begingroup$

Given group lasso penalty that is defined as $$g(W)= \lambda \sum_j \|W_j\|_2$$ is it possible to drive an smooth approximation this group penalty. I looked at following paper by Nestrove, but it was not easy to understand it.

Smooth minimization of non-smooth functions

I would like to solve this optimization problem very efficently. \begin{equation} \hat{W}={arg\,min}_W \frac{1}{2}||Y-XW||^2 + \lambda ||W||_1 + \gamma \sum_{i=1}^{k} ||W_i||_2 \end{equation} Whre $W \in R{M \times T}$ , $W_i$ $ith$ row of $W$ and $Y \in R^{N \times T}$. I would like to write original problem as : \begin{equation} \hat{W}={arg\,min}_W g(W) + \lambda ||W||_1 + h(W) \end{equation} Where $g(W) =\frac{1}{2}||Y-XW||^2$, $h(W)$: Smooth approximation of $\gamma \sum_{i=1}^{k} ||W_i||_2$, then $F(W)=g(W)+h(W)$ will be differential function and I can easily compute it's gradient.This can help me to solve original problem : $$F(W) + \lambda ||W||_1$$ using proximal gradient decent, since proximal of $||W||_1$ is known and can be computed very fast.

  • 1
    Summation is smooth and multiplication by $\lambda$ is smooth, so you only need a smooth approximation for the $2$-norm $\|W_j\|_2 = \sqrt{\smash[b]{\langle W_j,W_j\rangle}}$. How about $\sqrt{\smash[b]{\langle W_j,W_j\rangle + \epsilon}}$ for some small number $\epsilon$?2017-01-08
  • 0
    What do you need a smooth approximation to $g$ for?2017-01-08
  • 0
    I guess the OP means a smooth function $\tilde{g}$ on same domain, for which the approximation error $g-\tilde{g}$ can be controlled well.2017-01-08
  • 0
    @dohmatob that's exactly what I mean.2017-01-08
  • 0
    This might be a tangential comment, but: there are other, arguably simpler ways to solve group lasso problems. For example, you could just use FISTA. Are you sure you want to use the approach in "smooth minimization of non-smooth functions"?2017-01-08
  • 0
    @littleO, I see. I updated my OP. Do you have any idea on it ?2017-01-08
  • 2
    What you really want to do, as I suspect you know, is to derive a single proximal operator for both penalties. I'm working on an answer for that in your other question.2017-01-08
  • 0
    @MichaelGrant, thanks. Actually I tried to drive single proximal operator for both penalties. But unfortunately I stuck there. I would appreciate your effort on it .2017-01-08
  • 0
    With that single operator, @littleO's idea of using FISTA or any other proximal gradient method becomes feasible. No need to reinvent the wheel.2017-01-08
  • 0
    @MichaelGrant, I agree, Looking forward to see your derivation for single prox operator :)2017-01-08

2 Answers 2

3

For a compact convex subset $K$ of a finite dimensional Hilbert space $X$ and a bounded linear operator $A: Z \rightarrow X^*$ operator, consider the composite function $f =\sigma_K \circ A$, where $\sigma_K$ is the support of $K$, defined by $$ \sigma_K(z) := \sup_{y \in K}\langle z, y\rangle,\; \forall z \in X. $$

Now, Nestorov has introduced a class of smoothing approximations defined by

$$ f_\mu := \sup_{y \in K}\langle Ax, y\rangle - \frac{1}{2}\mu\|y\|_2^2, $$ for a smoothing parameter $\mu > 0$.

It's not too hard to see that

  • $f_\mu$ is $\mu$-strongly convex.
  • $|f_\mu - f| = \mathcal O(\mu) \rightarrow 0$ uniformly as $\mu \rightarrow 0^+$.
  • Using Danskin's Theorem, one can show that $f_\mu$ is smooth with $(1/\mu)$-Lipschitz gradient given by $f_\mu'(x) = A^T\hat{y}_\mu(x)$, where $$\hat{y}_\mu(x) := \text{argmax}_{y \in K}\langle Ax,y\rangle - \frac{1}{2}\mu\|y\|_2^2 = \text{Proj}_K\left(\frac{1}{\mu}Ax\right), $$ where $\text{Proj}$ is the euclidean projection operator onto $K$.

Thus if $K$ is "simple" (in the sense that $\text{Proj}_{K}$ can be easily applied) then you can imagine a very simple and intuitive algorithm for minimizing $f$: minimize $f_\mu$, reduce $\mu$, minimize $f_\mu$, reduce $\mu$, ... (until some desired tolerance is hit)

This technique is called Nesterov's smoothing method for functions of "max-type", and can be used in areas as diverse as game theory (solving for Nash equilibria for incomplete information sequential zero-sum games) and image processing (total-variation minimization, etc.)

Returning to the original question: Group Lasso

In the case of the group Lasso, you can take $X = \mathbb R^{p \times g}$ and $K := \{z \in X | \|z\|_{2,1} \le 1\}$ (unit ball for the $\ell_{2,1}$ mixed norm) and $A = \text{ identity matrix}$.

I hope this helps!

  • 0
    thanks. Would you please help me to adopt this to the OP ?2017-01-08
  • 0
    Done: I've adopted it to your question, which is a special instance of what my post was solving :)2017-01-08
  • 0
    @dohmaton, thanks, helps much. Actually, I would like to apply this for solving following optimization problem: \begin{equation} \hat{W}={arg\,min}_W \frac{1}{2}||Y-XW||^2 + \lambda ||W||_1 + \gamma \sum_{i=1}^{k} ||W_i||_2 \end{equation} . $W \in R^{m \times T}$ and $Wi$ is the $ith$ row of W. Do you think it can help? I edited my OP.2017-01-08
2

Any convex (and lower semicontinuous) function $f$ has a smooth convex approximation called the Moreau-Yosida regularization of $f$ (with parameter $\mu > 0$) defined by \begin{align} f^{(\mu)}(x) &= \text{min}_u \, f(u) + \frac{1}{2\mu} \| u - x \|_2^2 \\ \end{align} (This looks like the definition of the prox-operator of $f$, but we have "min" instead of "argmin".)

Note that $$ f^{(\mu)}(x) = f(\operatorname{prox}_{\mu f}(x)) + \frac{1}{2\mu} \| \operatorname{prox}_{\mu f}(x) - x \|_2^2. $$ This formula allows us to evaluate $f^{(\mu)}$ by evaluating the prox-operator of $f$. The gradient of $f^{(\mu)}$ can also be expressed in terms of the prox-operator of $f$.

However, this group sparsity penalty $g$ will no longer promote group sparsity effectively if you replace it with a smooth approximation, so I'd have to think carefully about how the smoothed version is going to be used.

  • 0
    Why do you think this will not promote group sparsity anymore ?2017-01-08
  • 0
    Because sparsity is by construction, introduced by kinks, which are, by construction, non-smooth objects.2017-01-08
  • 0
    If you use this smoothing technique in conjunction with Nesterov's iterative smoothing strategy, then maybe there is no problem because you will be ultimately minimizing the non-smooth objective function.2017-01-08