2
$\begingroup$

I have function \begin{equation} f(W)=\gamma ||W||_2 \end{equation} What is the prox operator of this?

  • 0
    Is this related to the other question you asked? Because if it is, it's irrelevant to this one, for two reasons. First, for the prox operator, you should be using the Frobenius norm, not the matrix 2 norm. Second, because you're using the elementwise 1-norm, you can compute the prox of each $w_j$ separately (i.e., $\|w_j\|_1+\|w_j\|_2$)2017-01-05
  • 0
    Of course, it's just the sum of the squares of the elements.2017-01-05
  • 0
    For the latter question: because that's the definition of that norm. For your first question, technically you can use whatever norm you want, which means you can pick the one that makes things easier to compute.2017-01-05
  • 0
    Slight correction: the *square* of the Frobenius norm is the sum of the squares of the elements2017-01-05
  • 0
    Technically you can use any Bregman distance in place of the squared norm.2017-01-05
  • 0
    There is not just _one_ proximal operator. The exact form of the prox depends on the distance function you choose. The matrix 2-norm will give a different result than the Frobenius norm, for instance. The algorithms that use the prox don't really care which one, so it's best to choose the one that makes the prox the easiest to determine and compute. In this case, that's the Frobenius norm squared.2017-01-05
  • 1
    Same minimizer. A different path to get there, perhaps, and different performance characteristics, but same minimizer.2017-01-05
  • 0
    I'm working on an answer for you. I just couldn't do it this week, I ran out of time, my paid work was calling!2017-01-08

1 Answers 1

2

anything for a bounty ;-) (In fact, I was simply running out of time during the work week.)

From previous discussions we know that the author is actually referring to an elementwise 1-norm, not the induced 1-norm. That is, $\|W\|_1\triangleq \sum_{ij}|w_{ij}|$. This means that this function is separable across columns:

$$f(W) = \sum_{j=1}^T \left( \lambda \|w_j\|_1 + \gamma \|w_j\|_2 + \tfrac{1}{2} \|w_j - u_j\|_2^2 \right)$$ So we can focus our view to the vector function $$g(w) = \lambda \|w\|_1 + \gamma \|w\|_2 + \tfrac{1}{2} \|w - u\|_2^2$$ The optimality conditions are: $$\lambda v_1 + \gamma v_2 + w = u, \quad v_1\in\partial \|w\|_1, \quad v_2\in\partial \|w\|_2$$ $$\partial \|w\|_1 = \{v\,|\, \|v\|_\infty \leq 1, ~ \langle v, w \rangle = \|w\|_1\}$$ $$\partial \|w\|_2 = \{v\,|\, \|v\|_2 \leq 1, ~ \langle v, w \rangle = \|w\|_2\}$$ To solve, let's define the standard soft-thresholding operator: $$\mathop{\textrm{soft}}(x;\lambda)= \begin{cases} x - \lambda & x > \lambda \\ 0 & |x| \leq \lambda \\ x + \lambda & x < -\lambda \end{cases}$$ and extend it to apply elementwise to vectors. Then we choose $$[v_1]_i = \mathop{\textrm{sign}}(u_i)\min\{|u_i|/\lambda,1\}, ~i=1,2,\dots m$$ $$\quad\Longrightarrow u - \lambda v_1 = \mathop{\textrm{soft}}(u;\lambda)$$ This reduces the optimality conditions to $$\gamma v_2 + w = \mathop{\textrm{soft}}(u;\lambda)$$ Let's call that right-hand term $q$ and consider three cases:

  • $u=0$. $q=0$ as well, so we can choose $v_1=v_2=w=0$.
  • $\|q\|_2 \leq \gamma$. We can choose $v_2=\gamma^{-1} q$, so $w=0$.
  • $\|q\|_2 > \gamma$. We can choose $v_2=q/\|q\|_2$, so $w=(1-\gamma/\|q\|_2) q$.

If you're familiar with the proximal operator for $\ell_2$ alone, you should recognize that we're doing the exact same operation here. Let's call it "shrink": $$\mathop{\textrm{shrink}}(q; \gamma) = \begin{cases} 0 & \|q\|_2 \leq \gamma \\ (1-\gamma/\|q\|_2\}) q & \|q\|_2 > \gamma \end{cases}$$ Therefore, we have $$\mathop{\textrm{arg min}} g(w) = \mathop{\textrm{shrink}}(\mathop{\textrm{soft}}(u;\lambda); \gamma).$$ That's right: to compute the prox for $\ell_1$ plus $\ell_2$, we simply apply the $\ell_1$ prox first, then the $\ell_2$!

So for the original problem, $$\mathop{\textrm{arg min}} F(W) = \bar{W} = \begin{bmatrix} \bar{w}_1 & \dots & \bar{w}_T \end{bmatrix}, \quad \bar{w}_j = \mathop{\textrm{shrink}}(\mathop{\textrm{soft}}(u_j;\lambda); \gamma).$$