13
$\begingroup$

I'm really new to linear regression and am trying to teach myself.

In my textbook there's a problem that asks why $R^{2}$ in the regression of $Y$ on $X =$ the sample correlation between X and Y the whole squared.

I've been throwing my head against this for a while and I keep getting stuck because in the correlation coefficient there is a $X$ and $\bar{X}$ term, whilst in the $R^{2}$ term there is no such thing.

Can anyone provide a derivation as to why $R^{2}$ is the correlation coefficient squared?

Thanks!

  • 1
    If by $R^2$ you mean the "explained variance", then stats.SE might be a more suitable site for this question. See, for example, [this question](http://stats.stackexchange.com/q/28139/6633) or [this one](http://stats.stackexchange.com/q/20553/6633) for some ideas related to this.2013-01-04

4 Answers 4

13

Suppose that we have $n$ observations $(x_1,y_1),\ldots,(x_n,y_n)$ from a simple linear regression $ Y_i=\alpha+\beta x_i+\varepsilon_i, $ where $i=1,\ldots,n$. Let us denote $\hat y_i=\hat\alpha+\hat\beta x_i$ for $i=1,\ldots,n$, where $\hat\alpha$ and $\hat\beta$ are the ordinary least squares estimators of the parameters $\alpha$ and $\beta$. The coefficient of the determination $r^2$ is defined by $ r^2=\frac{\sum_{i=1}^n(\hat y_i-\bar y)^2}{\sum_{i=1}^n(y_i-\bar y)^2}. $ Using the facts that $ \hat\beta=\frac{\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)}{\sum_{i=1}^n(x_i-\bar x)^2} $ and $\hat\alpha=\bar y-\hat\beta\bar x$, we obtain \begin{align*} \sum_{i=1}^n(\hat y_i-\bar y)^2 &=\sum_{i=1}^n(\hat\alpha+\hat\beta x_i-\bar y)^2\\ &=\sum_{i=1}^n(\bar y-\hat\beta\bar x+\hat\beta x_i-\bar y)^2\\ &=\hat\beta^2\sum_{i=1}^n(x_i-\bar x)^2\\ &=\frac{[\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)]^2\sum_{i=1}^n(x_i-\bar x)^2}{[\sum_{i=1}^n(x_i-\bar x)^2]^2}\\ &=\frac{[\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)]^2}{\sum_{i=1}^n(x_i-\bar x)^2}. \end{align*} Hence, \begin{align*} r^2 &=\frac{\sum_{i=1}^n(\hat y_i-\bar y)^2}{\sum_{i=1}^n(y_i-\bar y)^2}\\ &=\frac{[\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)]^2}{\sum_{i=1}^n(x_i-\bar x)^2\sum_{i=1}^n(y_i-\bar y)^2}\\ &=\biggl(\frac{\sum_{i=1}^n(x_i-\bar x)(y_i-\bar y)}{\sqrt{\sum_{i=1}^n(x_i-\bar x)^2\sum_{i=1}^n(y_i-\bar y)^2}}\biggr)^2. \end{align*} This shows that the coefficient of determination of a simple linear regression is the square of the sample correlation coefficient of $(x_1,y_1),\ldots,(x_n,y_n)$.

  • 3
    Could anyone explain the reason for the downvote?..2017-09-29
8

The complete proof of how to derive the coefficient of determination $R^{2}$ from the Squared Pearson Correlation Coefficient between the observed values $y_{i}$ and the fitted values $\hat{y}_{i}$ can be found under the following link:

http://economictheoryblog.wordpress.com/2014/11/05/proof/

In my eyes it should be pretty easy to understand, just follow the single steps.

1

There are many forms of the computation available online (such as the Wikipedia page on the correlation coefficient http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Pearson.27s_correlation_and_least_squares_regression_analysis ) but note that this is a magical algebraic property of least squares linear regression, not linear regression in general.

  • 0
    Which equation does not have an X term?2012-04-10
0

There are different forms to express R2: Some expressions have (X-Xbar) squared in the numerator, while others express it just with the square of predicted ys. All forms are equivalent.

References: Dougherty; Gujarati; Wooldridge