2
$\begingroup$

It's been a while since I took a statistics course, but this question came to mind the other day.

Let's suppose that I am looking at Salary data, but the only data provided is the quartiles. For example:

Q1 = 25 percentile = 40 000

Q2 = 50 percentile = 70 000

Q3 = 75 percentile = 100 000

Assuming that we have a normal distribution and the above information, is it possible to calculate any given percentile? If so, how?

Any help would be appreciated. Thanks!

  • 0
    Normal distributions are symmetric around their mean hence they satisfy the relation 2Q2=Q1+Q3, which your data does not. One should first try to remedy this, the rest is easy.2011-09-26
  • 0
    Good point. I've updated the question to make the data symmetric. It'll be good to get an answer to this question, but it looks like I'm asking the wrong question :)2011-09-26
  • 1
    Plus: salary data also cannot be normal, since it is not allowed to be negative.2011-09-26
  • 0
    @GEdgar, however, believe this or not, people DO model such data by gaussian distributions.2011-09-26

4 Answers 4

4

The gaussian random variable must be centered at $Q_2$ and its first and third quartiles must be at $Q_1$ and $Q_3$ respectively. Since the first and third quartiles of the gaussian random variable with mean $m$ and variance $\sigma^2$ are at $m-0.68\sigma$ and $m+0.68\sigma$ respectively, one gets $m=Q_2$ and $\sigma=(Q_2-Q_1)/.68=(Q_3-Q_2)/.68$.

Edit About $5.6\%$ of this distribution fall in the negative part of the real axis. This is usually considered as an acceptable trade-off between plausibility (since all the data should be nonnegative) and practicability (since gaussian models are so convenient).

  • 0
    From what I recall, σ is the standard deviation. Supposing that this is true, how can I find a given percentile? Thank you so far :)2011-09-26
  • 0
    Yes, $\sigma$ is the [standard deviation](http://en.wikipedia.org/wiki/Standard_deviation#Definition_of_population_values) and $\sigma^2$ is called the [variance](http://en.wikipedia.org/wiki/Variance). Not sure I understand your new question (one can compute every percentile of a standard gaussian random variable using [standard tools](http://www.wolframalpha.com/) available on the web since the [CDF](http://en.wikipedia.org/wiki/Cumulative_distribution_function) is known).2011-09-26
  • 0
    Again, it's been a while since I've done statistics. But that clears it up :)2011-09-26
  • 0
    No problem. It is all the more satisfying to be understood.2011-09-26
1

Your data can't be a normal distribution, because then the distance between Q1 and Q2 would be the same as the distance between Q2 and Q3.

1

What Henning Makholm says is right. But assuming you can correct this error I think you need to solve the following equation for $\sigma$ : $$ 0.75=\int_{-\infty}^{Q_3}\frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-Q_2)^2}{2\sigma^2}}dx $$ You may try numeric approximation.

After you get the variance you can easily standardize to get any quantile you want.

  • 0
    Once I have the variance, how can I find the percentile? It has been some time since I reviewed my statistics texts :)2011-09-26
  • 0
    In order to get the $\alpha$ percentile you need to remember that if $X\sim N(Q_2, \sigma^2)$ then $\frac{X-Q_2}{\sigma} \sim N(0,1)$ so you can use the CDF of this normal standardized variable, i.e. $$ \Phi(\frac{x-Q_2}{\sigma})=\alpha \rightarrow x=Q_2+\sigma \Phi^{-1}(\alpha) $$2011-09-28
  • 0
    If you're trying to solve an applied problem I'd recommend you to use R [link](http://www.r-project.com), where you don't even have to "understand" the formulae since you "speak" and "write" in R about means, standard deviations, shapes, etc. Also, the vector handling in R is absolutely fantastic for doing statistics.2011-09-28
1

If you fit the quantiles to a known distribution, you can calculate any percentile with the distribution's quantile function, which is the inverse of the CDF. However, with only 3 quantiles, any 3-parameter distribution will fit, so you need to choose the distribution beforehand. If possible you should get some raw data or more quantiles. See this link also has some handy R code for fitting quantiles to a distribution using optim() and the distribution's quantile function.

I've found that income/salary data are best fit by a generalized (aka shifted, aka 3-parameter) log-logistic distribution. The log-logistic also has the advantage of having a closed-form quantile function which is easy to calculate and easy for the optimization library to use. I ended up having to write my own shifted log-logistic quantile function after not finding exactly what I wanted in available R packages:

# Shifted/generalized log-logistic quantile function
# http://en.wikipedia.org/wiki/Shifted_log-logistic_distribution
# The qllog3() function from package FAdist appears to be a different parameterization
# location = mu, scale = sigma, shape = xi
qshllogis <- function(p,location=0,scale=1, shape=0) {
    if(shape == 0) {
        # Revert to logistic distribution
        return( qlogis(p,location,scale) );
    }
    else {
        return(scale * ( (1/p - 1)^(-shape) - 1) / shape + location);
    }
}