1
$\begingroup$

I don't quite get how logistic loss works for binary classification:

$$\log(1+\exp(−y\cdot \mathbf{w}^T\mathbf{x})), \quad y\in\{−1,+1\}$$

Minimizing this function for $\mathbf{w}$ seems to me to simply make $\mathbf{w}^T\mathbf{x}$ as large as possible, meaning setting $w_i$ to infinity (negative or positive - depending on $x_i$).

What do I misunderstand?

  • 0
    Don't you have multiple observations though?2017-01-10
  • 1
    @LinAlg: Yes. Is it that they sort of balance each other out? I understand logistic regression as a fitting of $f(w)=1/(1+e^{−w^Tx})$, but $f$ isn't excatly a loss function. The loss function would be the sum of all $f(w|x_i)-y_i$. So how does the logistic loss come into play and -insert-above-question-?2017-01-10

1 Answers 1

2

$f(w; x_i) = 1/(1+\exp(-y_i w^Tx_i))$ is the probability on observing $y_i$ given $x_i$. Given a set of observations, assuming independence, you obtain the product of these functions. Applying a logarithmic transformation does not affect the location of the maximum (merely its value), and removing the negative sign turns it into a minimization problem. You are therefore interested in the $w$ that minimizes $-\log \prod_i f(w; x_i)$, or, equivalently, that minimizes $$\sum_i \log\left(1+\exp(-y_i w^Tx_i)\right).$$ Now you cannot simply let $w^Tx_i$ go to $\infty$ for all $i$.