7
$\begingroup$

Considering a binary classification problem with data $D = \{(x_i,y_i)\}_{i=1}^n$, $x_i \in \mathbb{R}^d$ and $y_i \in \{0,1\}$. Given the following definitions:

$f(x) = x^T \beta$

$p(x) = \sigma(f(x))$ with $\sigma(z) = 1/(1 + e^{-z})$

$L(\beta) = \sum_{i=1}^n \Bigl[ y_i \log p(x_i) + (1 - y_i) \log [1 - p(x_i)] \Bigr]$

where $\beta \in \mathbb{R}^d$ is a vector. $p(x)$ is a short-hand for $p(y = 1\ |\ x)$.

The task is to compute the derivative $\frac{\partial}{\partial \beta} L(\beta)$. A tip is to use the fact, that $\frac{\partial}{\partial z} \sigma(z) = \sigma(z) (1 - \sigma(z))$.

So here is my approach so far:

\begin{align*} L(\beta) & = \sum_{i=1}^n \Bigl[ y_i \log p(x_i) + (1 - y_i) \log [1 - p(x_i)] \Bigr]\\ \frac{\partial}{\partial \beta} L(\beta) & = \sum_{i=1}^n \Bigl[ \Bigl( \frac{\partial}{\partial \beta} y_i \log p(x_i) \Bigr) + \Bigl( \frac{\partial}{\partial \beta} (1 - y_i) \log [1 - p(x_i)] \Bigr) \Bigr]\\ \end{align*}

\begin{align*} \frac{\partial}{\partial \beta} y_i \log p(x_i) &= (\frac{\partial}{\partial \beta} y_i) \cdot \log p(x_i) + y_i \cdot (\frac{\partial}{\partial \beta} p(x_i))\\ &= 0 \cdot \log p(x_i) + y_i \cdot (\frac{\partial}{\partial \beta} p(x_i))\\ &= y_i \cdot (p(x_i) \cdot (1 - p(x_i))) \end{align*}

\begin{align*} \frac{\partial}{\partial \beta} (1 - y_i) \log [1 - p(x_i)] &= (1 - y_i) \cdot (\frac{\partial}{\partial \beta} \log [1 - p(x_i)])\\ & = (1 - y_i) \cdot \frac{1}{1 - p(x_i)} \cdot p(x_i) \cdot (1 - p(x_i))\\ & = (1 - y_i) \cdot p(x_i) \end{align*}

$\frac{\partial}{\partial \beta} L(\beta) = \sum_{i=1}^n \Bigl[ y_i \cdot (p(x_i) \cdot (1 - p(x_i))) + (1 - y_i) \cdot p(x_i) \Bigr]$

So basically I used the product and chain rule to compute the derivative. I am afraid, that my solution is wrong, because in Hasties The Elements of Statistical Learning on page 120 it says the gradient is:

$\sum_{i = 1}^N x_i(y_i - p(x_i;\beta))$

I don't know what could have possibly gone wrong, any advices on this?

  • 0
    Yes, double checked it, the basis defi$n$ition is now correct.2012-04-28

3 Answers 3

2

So, if $p(x)=\sigma(f(x))$ and $\frac{d}{dz}\sigma(z)=\sigma(z)(1-\sigma(z))$, then

$\frac{d}{dz}p(z) = p(z)(1-p(z)) f'(z) \; .$

This changes everyting and you should arrive at the correct result this time.

In particular,

$\frac{d}{dz}\log p(z) = (1-p(z)) f'(z)$

and

$\frac{d}{dz}\log (1-p(z)) = -p(z) f'(z) \; .$

  • 0
    Also be careful because your $\beta$ is a vector, so is $x$. So you should really compute a gradient when you write $\partial/\partial \beta$.2012-04-28
2

The classification problem data can be captured in one matrix and one vector, i.e. $\{X,y\}$.

Then the relevant quantities are the vectors $\eqalign{ f &= X^T\beta \cr p &= \sigma(f) \cr }$ and their differentials and logarithmic differentials $\eqalign{ df &= X^Td\beta \cr dp &= p\circ(1-p)\circ df \cr\cr d\log(p) &= \frac{dp}{p} \,=\, (1-p)\circ df \cr d\log(1-p) &= \frac{-dp}{1-p} \,=\, -p\circ df \cr }$ where $(g\circ h)$ and $\big(\frac{g}{h}\big)$ denote element-wise (aka Hadamard) multiplication and division.

The likelihood function is a scalar which can be written in terms of Frobenius products $\eqalign{ L &= y:\log(p) + (1-y):\log(1-p) \cr }$ whose differential is $\eqalign{ dL &= y:d\log(p) + (1-y):d\log(1-p) \cr &= y:(1-p)\circ df - (1-y):p\circ df \cr &= (y-p):df \cr &= \big(y-p\big):X^Td\beta \cr &= X\,\big(y-p\big):d\beta \cr }$ Yielding the gradient as $\eqalign{ \frac{\partial L}{\partial\beta} &= X\,(y-p) \cr }$

  • 0
    Why did the transpose of X become just X? How did you remove the transpose by moving the order to the front?2018-01-29
0

In your third line, while differentiating you missed out $1/p(x_i)$ which is the derivative of $\log(p(x_i))$.

\begin{eqnarray} d/db(y_i \cdot \log p(x_i)) &=& \log p(x_i) \cdot 0 + y_i \cdot(d/db(\log p(x_i))\\ &=& y_i \cdot 1/p(x_i) \cdot d/db(p(x_i)) \end{eqnarray}

Note that $d/db(p(xi)) = p(x_i)\cdot {\bf x_i} \cdot (1-p(x_i))$ and not just $p(x_i) \cdot(1-p(x_i))$.

Also in 7th line you missed out the $-$ sign which comes with the derivative of $(1-p(x_i))$.