0
$\begingroup$

(To be honest I don't know if this is the right SE site for my question, but it is the closet I can think of!)

I'm having an argument with my friend on whether we can infer an association between the two variables from the following scatter plot:

enter image description here

I think there is a positive association between the fertility and child mortality, while my friend thinks this graph is too ugly for that to be inferred. My argument is that if we take the average of all y-values for each x-coordinate, the result obviously shows a positive association. Which of us is right?

UPDATE: Well, I computed the correlation coefficient and it turned out to be 0.7, which shows there exists a strong association.

  • 0
    Seems larger circles are for larger populations. Did you weight points according to size when you got $r = 0.7?$ Also, (especially without weighting) it would be possible to have a significantly positive correlation for _regions_ and not have such a correlation for individual _families._ Don't suppose so here, but possible. However, bottom line here is I suspect significantly positive correlation.2017-02-23

2 Answers 2

2

The difficult thing about statistics is that rarely is anyone "right." Sometimes your most educated guess about something could be completely wrong, simply due to chance variation in the data you happened to gather. That said, there are a few things you can do to settle the debate.

Do you have the original data or are you relying solely on the graphic you provided in order to make inferences? No doubt the graphic shows a positive correlation, but the correlation also appears very weak. If you have the data, you want to do a hypothesis test. In other words, assuming there is no correlation, what is the probability that you would have observed a correlation as strong as you did? Typically, the arbitrary custom in statistics is to consider a probability of less than .05 as conclusive.

If you have a statistical package (like R - which is free, by the way), then you can test this really easily. Use the read.table("filename.txt") or read.csv("filename.csv") commands. Then run this code:

lmResults <- lm(responseVariable ~ predictorVariable, data = nameOfDataset) summary(lmResults)

In the output you'll find a p-value for the coefficient on the predictor variable. If it's less than .05, you win! If not, you might have to hand it to your friend.

  • 0
    Thanks! Do you know of a way to do it in Python?2017-02-23
1

To me, the most interesting observations are (1) there are two distinct clusters and (2) at least one of the east asia pacific countries is a distinct outlier. This is a great starting point for an analysis into root causes and how other variables, e.g., GNP, or Education might contribute. It would be great to highlight high ranges or GNP or Education and see how this bivariate relationship is conditioned on these other variables. For example, are the clusters colored differentially along these dimensions? This is a great example for showing visualization students how seeing patterns in the data can elicit questions that drive the exploration process.