This looks like multivariate gradient descent.
In this case, indeed, for convenience we choose $x_0=1$, more explicitly $x_0^{(i)}=1 \space \forall i$.
By making this choice the hypothesis function of our $x_1...x_n$ features: \begin{align*} h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+...\theta_nx_n \end{align*} can be conveniently written as: \begin{align*} h_\theta(x)=\theta_0x_0+\theta_1x_1+\theta_2x_2+...\theta_nx_n=\theta^Tx \end{align*} When applying gradient descent to fit our $\theta$ parameters in the step that "descends downhill": \begin{align*} \theta_j:=\theta_j+\alpha\sum_{i=1}^{m}(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)} \space (\forall j) \end{align*} for $j=0$, given $x_0=1$, we'll be having: \begin{align*} \theta_0:=\theta_0+\alpha\sum_{i=1}^{m}(y^{(i)}-h_\theta(x^{(i)})) \end{align*} As such, one may say that " for $\theta_0$ the $x_j$ is not applicable".
The reason you "quickly get to infinity or some very large numbers" is that your $\alpha$ parameter is too large and in this case gradient descent does not converge. You must find the sufficiently small $\alpha$ parameter for which the gradient descent converges, and theory says that it exists. Notice, that if $\alpha$ is too small, the algorithm may be too slow, so you want to find an $\alpha$ just small enough so that the gradient descent converges, not smaller.