1
$\begingroup$

I want to compute the gradient for the following function:

$$L(\beta) = \sum_{i=1}^n (y_i - \phi(x_i)^T \cdot \beta)^2 + \sum_{j = 1}^k l(\beta_j)$$

where $l(\beta_j) = \begin{cases} \beta_j - \varepsilon/2 & \textbf{if } |\beta_j| \geq \varepsilon\\ \beta_j^2 / (2\varepsilon) & \textbf{if } |\beta_j| < \varepsilon\\ \end{cases}$

Here $y$, $\beta$ and $\phi(x_i)$ is a vector $\in \mathbb{R}^n$ and $\varepsilon$ is just a small number, like $0.00001$

Computing the gradient for the first part of the term yielded no problem:

$$ \frac{\partial}{\partial \beta} \sum_{i=1}^n (y_i - \phi(x_i)^T \cdot \beta)^2 = -2 \sum_{i=1}^n - \phi(x_i) \cdot (y_i - \phi(x_i)^T \cdot \beta)$$

But I have no idea for the second term which is the sum of the function $l(\beta)$ applied to each component of the vector $\beta$, since it does not contain the whole vector anymore, but its single components plus the additional function confuses me. How can I proceed here?


Or is it that easy, that I just compute the derivation of it like this:

$\frac{\partial}{\partial \beta_j} l(\beta_j) = \begin{cases} 1 & \textbf{if } |\beta_j| \geq \varepsilon\\ \beta_j / \varepsilon & \textbf{if } |\beta_j| < \varepsilon\\ \end{cases}$

But then I still see the problem that the gradient of the first term results into a vector, but the second does not.

1 Answers 1

1

This function doesn't have a gradient everywhere, since it's discontinuous (and thus not differentiable) whenever one of the $\beta_j$ is $-\epsilon$. Perhaps you omitted some absolute value signs in the definition of $l$?

At all points with $\beta_m\ne\epsilon$ for all $m\le k$, the function has a gradient, and you can indeed determine it using the calculation in the second part of the question. I don't understand your statement that the gradient of the second term doesn't result in a vector. The gradient is by definition a vector; it cannot not result in a vector. You've calculated the derivatives $\frac{\partial}{\partial\beta_m} l(\beta_m)$; the derivatives $\frac{\partial}{\partial\beta_m} l(\beta_j)$ for $j\ne m$ are trivially zero. The sum in the second term contains a term $l(\beta_m)$ if and only if $m\le k$. Thus the components of the gradient of the second term are

$$ \frac{\partial}{\partial\beta_m}\sum_{j=1}^kl(\beta_j)= \begin{cases} 0&m\gt k\;,\\ 1&m\le k\;,|\beta_m|\ge\epsilon\;,\\ \beta_m/\epsilon&m\le k\;,|\beta_m|\lt\epsilon\;. \end{cases} $$

If this seems slightly confusing, I suggest you first treat the easier case $k=n$, where the sum contains one term for each component of $\beta$, and then think about what happens when $k\lt n$ and some of the terms are missing.

  • 0
    Why is $-\frac{3}{2}\varepsilon$ for the case $\beta_j = - \varepsilon$ discontinuous? To the question why the second term is or is not a vector: It is the sum of the single components of the $\beta$ vector (applied with the $l$ function), isn't that a scalar in the end? When resolving the sum from $j = 1$ to $k$ the result is a scalar, isn't it?2012-05-06
  • 0
    @Mahoni: Unfortunately I understand neither of your two questions. In the first one, you're asking why a number is discontinuous, whereas continuity is a property of functions, not of numbers. In case you mean why the function value $-3\epsilon/2$ implies a discontinuity of $l$ at $\beta_j=-\epsilon$, note that the values of $l$ as $\beta_j$ approaches $-\epsilon$ from above tend to $(-\epsilon)^2/(2\epsilon)=\epsilon/2\ne-3\epsilon/2$. On the second question, how is a quantity that contains two indexed quantities ($\beta_m$ and $\beta_j$) a scalar when summed over one of the two indices?2012-05-06
  • 0
    The result of $l(\beta_j)$ is $\in \mathbb{R}$, isn't it? If that is true, then the sum of the results, is still $\in \mathbb{R}$.2012-05-06
  • 1
    @Mahoni: Please be more precise. The sum of which results is in $\mathbb R$? In case you mean the sum over $l(\beta_j)$ in the displayed equation in my answer, yes, the value of that sum is in $\mathbb R$. What makes this the component of a vector (namely the gradient) is that the partial derivative preceding it has an index $m$. The gradient is defined to be the vector whose $m$-th component is the derivative of the function with respect to the $m$-th variable, in this case $\beta_m$.2012-05-06
  • 1
    Thanks for your patience, I think I understood now from where most of my confusion originated. Sorry for being so imprecise, I still struggle often in finding the right phrasing.2012-05-06
  • 0
    @Mahoni: You're welcome! And no worries.2012-05-06