I know that the basic difference between the two is that in maximum likelihood approach, parameter vector (say $w$) is considered a constant but in bayesian we make use of prior probability which also helps us illustrate the uncertainty in the predicted value of $w$. But I don't understand the general difference between the two as I am a beginner in the field of machine learning and stats. Like in maximum likelihood approach , we solve for $w$ and predict result of new inputs (say $x$) as $x^* w$. What do we solve for in bayesian approach and how new values are predicted ? Also how does bayesian regression automatically chooses the model comolexity and avoids the problem of overfitting ?
How does bayesian regression differs from maximum likelihood regression?
1 Answers
I think when you said 'prior probability', you meant 'posterior probability'.
In a Bayesian approach, you infer a posterior probability distribution for the parameters $w$, so your prediction is an average $E(x*w)$ where $E$ is taken over the posterior distribution. In the maximum likelihood approach you infer a point estimate $\hat w$ and your prediction is $x*\hat w .$
(This distinction isn't as important when everything's linear like this... In the Bayesian case, since $E(x*w) = x*E(w),$ we might as well say that we have a point estimate $\hat w = E(w)$ and that our prediction is $x*\hat w.$ For nonlinear models, the distinction is important.)
You're right that the posterior probability distribution gives a sense of the uncertainty in the predicted value of the parameters. However, in the maximum likelihood method, you can also get a sense of uncertainty in the prediction by confidence intervals/regions.
-
0Then how is bayesian regression more useful than maximum likelihood ? And why there is no problem of overfitting ? And I also read that we don't need to manually decide the model comolexity . – 2017-01-17
-
0First of all, it's good to keep all the info of the posterior. When you reduce to a point estimate, it throws away information... that's bad. (Though in this linear case, it didn't matter for our *point* prediction). I wouldn't say there's no problem of overfitting, but it's true that in Bayesian models, an informative prior can reduce overfitting (just like regularization/penalized regression). 'Not needing to manually define the model complexity' also works along these lines. (But remember, requires a strong prior). Details are best left to a separate question (prob should ask over at stats) – 2017-01-17
-
0Here is a stats post on this subject http://stats.stackexchange.com/questions/82664/bayesian-vs-mle-overfitting-problem. I'd say beware of strong claims like 'bayesian methods automatically do model selection and never overfit'. – 2017-01-17
-
0Well that is great. I am a machine learning enthusiast but I have no prior knowledge in the field of probability etc. I know how to make simple neural nets etc but I want to get a mathematical perspective , so I started reading "pattern recognition and machine learning" by bishop but I kinda feel stuck at places . What do you recommend ? Like how should I approach machine learning ? – 2017-01-17
-
0I'm not really a machine learning expert so can't really say for sure what the best approach is. As for textbooks, I always liked MacKay's 'Information theory, inference and learning algorithms'. 'Statistical Learning' by Hastie et al, or for an intro, 'Intro to statistical learning' by James/Witten/Hastie are usual standards. Also, there's a lot of lecture videos and course material out there, like here http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/ (not sure it's mathematical enough for you). You might want to do a reference request at stats.stackexchange. – 2017-01-17
-
0Okay. thanks for you your time – 2017-01-17