2
$\begingroup$

Let's suppose I do a regression between earnings and age (and suppose I do not know the distribution of earnings). Would it be possible for the residuals to be normally distributed?

I am thinking it would not be possible since earnings only takes on positive values and since the support of the normal is from $-\infty$ to $\infty$, it would not be normal. However, since residuals are errors, they can be both positive and negative, so I am starting to question my hypothesis here.

Any help would be great on whether or not it is possible for residuals to be normal for the scenario I described.

  • 0
    Sure - why not? Of course, such $a$ distri$b$ution of residuals probably means you should do a different type of regression. But in many $c$ases, we could certainly cook u$p$ a regression that would give a normal distribution on the residuals.2012-01-25

3 Answers 3

2

If earnings are always positive then no, the residuals cannot be normally distributed, even though many may be negative: the magnitude of the negative residuals are bounded by the highest predicted earnings on the regression line.

That may not be the major issue: more important might be issues such as the skewness of earnings distributions at any age, or a non-linear relationship between earnings and age .

  • 0
    It is very unusual for statistics drawn from real life to be normally distributed, and ones which by definition are bounded are not. But for many practical purposes this may not matter much. As for the skewness of earnings there are many people at or near a minimum wage but very few earning millions and much fewer earning billions. A UK household [income distribution graph](http://www.taxresearch.org.uk/Documents/houseincome.jpg) illustrates the point: this is not only earnings and misses out many with high incomes, but the right-skewness is obvious.2012-01-26
1

The normality assumption may be good (or not), even tough you should expect some skewness due to the non negativity of wages. The only way to make sure that your assumption of normally distributed error terms is good, is to test it. To do that, I have some suggestions (as far as my very weak statistics knowledge reaches).

As a first stage plot your data in several complementary ways, such as in box plots and histograms. If your assumption of normally distributed data is bad, it will probably show up at this stage. To complement the above diagrams you can also do a normal qq-plot. (See for example wikipedia http://en.wikipedia.org/wiki/Normal_probability_plot).

I do not know if you are familiar with hypothesis testing? As a second stage you can try to perform some hypothesis testing, for example the Shapiro Wilk test.

Thirdly, if your hypothesis of normally distributed error terms seems bad, try to identify if there is an obvious source causing this, for example outliers. (See http://en.wikipedia.org/wiki/Outlier#Identifying_outliers).

Lastly, there is a lot that can be done to test your normality assumption (Surprise).There are certainly other aspects of the model above that you also need to consider. If you are really interested in these matters I would recommend you to buy an introductory book on econometric analysis.

Note also: The normality assumption of your error terms is only important if you want to perform hypothesis testing on the parameters of your model. Look at the wikipedia-page on regression analysis to see which underlying assumptions that are made when performing a regression.

I am in no way claiming that the above procedure is the best/only way to proceed.

0

You can always skew-zero transform a y-variable (earnings) if transforming skewed x-variables do not result in normally-distributed residuals. van Der Waerden scores would do a good job here, so to begin:

  1. Determine percentile values, $pct_i$, of each y-value based on rank position, $R(y_i)$, after an ascending sort.
  2. Obtain the van der Waerden scores by plugging in the percentile values into the inverse CDF, i.e., $Z_i=\Phi^{-1}(pct_i)$
  3. Then regress $Z$ on age, providing age is not skewed too much.

By definition, van der Waerden scores are mean-zero standard normal distributed, $\mathcal{N}(0,1)$, so the residuals should now be normally distributed.

To interpret the coefficient on age, just deconvolve.