0
$\begingroup$

This is the rule:
\begin{align*} \theta_j:=\theta_j+\alpha\sum_{i=1}^{m}(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)} \space (\forall j) \end{align*} I can't get the correct thetas, they quickly get to infinity or some very large numbers. My question is, for $\theta_0$, $x_0$ is always 1 right? And so, for $\theta_0$ the $x_j$ is not applicable.

Thank you (Merry Christmas)

  • 0
    @Gigili thanks for the suggestion2011-12-26

1 Answers 1

3

This looks like multivariate gradient descent.
In this case, indeed, for convenience we choose $x_0=1$, more explicitly $x_0^{(i)}=1 \space \forall i$.
By making this choice the hypothesis function of our $x_1...x_n$ features: \begin{align*} h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+...\theta_nx_n \end{align*} can be conveniently written as: \begin{align*} h_\theta(x)=\theta_0x_0+\theta_1x_1+\theta_2x_2+...\theta_nx_n=\theta^Tx \end{align*} When applying gradient descent to fit our $\theta$ parameters in the step that "descends downhill": \begin{align*} \theta_j:=\theta_j+\alpha\sum_{i=1}^{m}(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)} \space (\forall j) \end{align*} for $j=0$, given $x_0=1$, we'll be having: \begin{align*} \theta_0:=\theta_0+\alpha\sum_{i=1}^{m}(y^{(i)}-h_\theta(x^{(i)})) \end{align*} As such, one may say that " for $\theta_0$ the $x_j$ is not applicable".

The reason you "quickly get to infinity or some very large numbers" is that your $\alpha$ parameter is too large and in this case gradient descent does not converge. You must find the sufficiently small $\alpha$ parameter for which the gradient descent converges, and theory says that it exists. Notice, that if $\alpha$ is too small, the algorithm may be too slow, so you want to find an $\alpha$ just small enough so that the gradient descent converges, not smaller.