2
$\begingroup$

I'm a little confused on the residuals vs fitted values plot. I take data in a table and create a scatter plot. Then I run a linear regression computation using my calculator or other program that gives me a linear regression line in the form of $y=mx+b$. I then create a table which calculates the estimated or predicted value $y$ based on the input ($x$). I compare this predicted value with the actual value to compute the difference or "residual".

I then plot this on a graph showing the $x$ values on the horizontal axis and the residual values (difference value) on the y axis.

This is how my text shows it. However, when I look online at other stats sites I see the residuals plotted with the "Fitted Value" on the $x$-axis. This does NOT seem to be the same thing as the original $x$ values from the table. And to confound things further, I read that the most common residual plots show the fitted value. What is the fitted value? Why isn't it the same as the original $x$ values?example of fitted Value residual

Am I misunderstanding this?

  • 2
    Don't worry yourself! Both are correct. For simple linear regression (only 1 predictor variable), it is common to plot the residuals vs X. The thing is, when you get to multiple linear regression ($Y = a + bX_1 + cX_2$) which X should you plot on the x axis?? This is why people often plot residuals vs Fitted Values.2017-02-13

1 Answers 1

2

Residuals vs. $x$. In simple linear regression (one predictor variable $x$), it is common to display a plot of residuals vs. $x.$ This is a 'diagnostic' plot to see if assumptions of the regression model are met. Ideally, you will see no patterns in this plot.

A common departure from assumptions is that the variability increases as $x$ increases. Here is an example. Scheduled flight times of non-stop flights were regressed on miles traveled. Notice that residuals 'fan out' (increase in variability) as distance increases.

enter image description here

This means that one cannot get reliable prediction intervals for flight times based on hours. The difficulty here turned out to be that direction of travel is not taken into account. [Eastbound flights take less time because they often have tail winds; westbound flights often have headwinds. The greater the distance, the greater the extra variability due to the ignored variable, direction.]

Residuals vs. Fits. If you plot residuals against fits for the same regression as above, the result will look essentially the same because fits are a linear function of 'Miles' ($x$). More generally, fits are $\hat Y = \hat \beta_0 + \hat \beta_1 x.$

enter image description here

So why bother to plot residuals vs. fits at all? The answer lies in multiple regression, in which you have several predictor variables $x_1, x_2, \dots x_k.$

For multiple regression, it is usually more helpful to look at a single plot of residuals vs. fits (as per the Comment by @knrumsey). One could look at residuals vs. $x_1$, residuals vs $x_2,$ and so on. At some point that may be useful in discovering that one of the predictor variables is causing a problem with assumptions.

But as a start, it is best to look at residuals vs. fits. If all is well in that plot, move along to the next diagnostic procedure. For the same data as above, here is a plot of residuals vs. fits when travel direction is taken into account ($x_1 =$ distance, $x_2=$ direction). On the whole it is a much more satisfactory residual plot. [The two points at the far right are trans-oceanic flights, which should ideally be removed from the dataset because rules for flights over oceans are much different than for flights over land.]

enter image description here

Note: In a discussion of simple linear regression, you may find a author who (embarrassingly) goes on and on about fundamental differences between plots of residuals against $x$ and residuals against fits. I hope a comparison of the first two plots above will help you to understand that there is no fundamental difference in simple linear regression.

Addendum after Comments: If you have the data in the order collected, it is a good idea to plot residuals in the order of data collection. Here is an experiment in which a regression line fits nicely through the data (not shown), and the plot of residuals vs. fits looks fine, but the plot of residuals vs. order shows early residuals to be mainly negative and later ones to be mainly positive. A lab assistant re-calibrated the measurement device about halfway through the experiment.

enter image description here

enter image description here

  • 0
    Yes, what a fantastic and complete answering of this question! In a similar vein, I suppose it IS possible to plot the linear regression line and have it look very linear, yet see a curved pattern in the residuals plot which says linear may not be best fit.2017-02-13
  • 0
    There are several issues here: (1) linearity, (2) equal variance, (3) normality of errors, (4) independence of observations, etc. Even if nonlinearity is slight, you should be able to see it if you look carefully at a plot of the regression line through the data. However, the scale of the plot of residuals makes it much easier to spot nonlinearity, and (as we have seen) unequal variances.2017-02-13
  • 0
    Addendum to Answer shows importance of looking at residuals vs order. This is for real data in which observations are recorded in order of collection. Not textbook data where observations are sorted in x or y order.2017-02-13