Considering a binary classification problem with data $D = \{(x_i,y_i)\}_{i=1}^n$, $x_i \in \mathbb{R}^d$ and $y_i \in \{0,1\}$. Given the following definitions:
$f(x) = x^T \beta$
$p(x) = \sigma(f(x))$ with $\sigma(z) = 1/(1 + e^{-z})$
$$L(\beta) = \sum_{i=1}^n \Bigl[ y_i \log p(x_i) + (1 - y_i) \log [1 - p(x_i)] \Bigr]$$
where $\beta \in \mathbb{R}^d$ is a vector. $p(x)$ is a short-hand for $p(y = 1\ |\ x)$.
The task is to compute the derivative $\frac{\partial}{\partial \beta} L(\beta)$. A tip is to use the fact, that $\frac{\partial}{\partial z} \sigma(z) = \sigma(z) (1 - \sigma(z))$.
So here is my approach so far:
\begin{align*} L(\beta) & = \sum_{i=1}^n \Bigl[ y_i \log p(x_i) + (1 - y_i) \log [1 - p(x_i)] \Bigr]\\ \frac{\partial}{\partial \beta} L(\beta) & = \sum_{i=1}^n \Bigl[ \Bigl( \frac{\partial}{\partial \beta} y_i \log p(x_i) \Bigr) + \Bigl( \frac{\partial}{\partial \beta} (1 - y_i) \log [1 - p(x_i)] \Bigr) \Bigr]\\ \end{align*}
\begin{align*} \frac{\partial}{\partial \beta} y_i \log p(x_i) &= (\frac{\partial}{\partial \beta} y_i) \cdot \log p(x_i) + y_i \cdot (\frac{\partial}{\partial \beta} p(x_i))\\ &= 0 \cdot \log p(x_i) + y_i \cdot (\frac{\partial}{\partial \beta} p(x_i))\\ &= y_i \cdot (p(x_i) \cdot (1 - p(x_i))) \end{align*}
\begin{align*} \frac{\partial}{\partial \beta} (1 - y_i) \log [1 - p(x_i)] &= (1 - y_i) \cdot (\frac{\partial}{\partial \beta} \log [1 - p(x_i)])\\ & = (1 - y_i) \cdot \frac{1}{1 - p(x_i)} \cdot p(x_i) \cdot (1 - p(x_i))\\ & = (1 - y_i) \cdot p(x_i) \end{align*}
$$\frac{\partial}{\partial \beta} L(\beta) = \sum_{i=1}^n \Bigl[ y_i \cdot (p(x_i) \cdot (1 - p(x_i))) + (1 - y_i) \cdot p(x_i) \Bigr]$$
So basically I used the product and chain rule to compute the derivative. I am afraid, that my solution is wrong, because in Hasties The Elements of Statistical Learning on page 120 it says the gradient is:
$$\sum_{i = 1}^N x_i(y_i - p(x_i;\beta))$$
I don't know what could have possibly gone wrong, any advices on this?
