3
$\begingroup$

I have a question in a survey X that can be rated between 1 and 10 (ordinal). The answers can be split in group A and group B.

I want to know if the mean of group A's answers significantly differ from groups B rating. Which test is the best one to do so and how can I do this with SPSS?

Thank you very much for your help!

2 Answers 2

1

This seems to be a two-sample test with Groups 1 (of size $n_1$) and 2 (of size $n_2$). Your data are scores from 1 to 10 on the question.

Welch t test. If $n_1$ and $n_2$ are large enough (perhaps both above 20), you might be able to get a reliable answer using a Welch 2-sample t-test.

Wilcoxon test. You are almost sure to have lots of ties (repeated scores) even if both sample sizes are relatively small. Thus you will get error messages about ties when trying to do a Wilcoxon rank-sum test, along with an approximated P-value or a statement that a P-value is not available (depending of the software you use).

Permutation test. Perhaps it is best to do a permutation test. Under the null hypothesis that the two groups tend to give the same responses to the question, the argument is that the scores could be permuted between Groups A and B without effect. So if we choose some measure of difference such as the difference $D = \bar X_1 - \bar X_2$ between the two sample means, we can use either combinatorics or simulation to get the null permutation distribution of $D$, and judge whether your observed value of $D$ is consistent with the null distribution.

Example. I will illustrate each kind of test using fake data with 25 subjects in each group (although none of the tests require sample sizes to be equal).

Here are listings and summaries of some fake data to use for testing.

x1; summary(x1)
##  9  6  4 10  5  5  8  8  8  8  8  9  8  6  4  7  8  9  8  6  8  8  5  8  9
##  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  4.00    6.00    8.00    7.28    8.00   10.00 

x2; summary(x2)
## 10  9 10  7  7  8  8 10  8  5  7  7  7  5  8 10 10 10  9  7  9 10 10 10  9
##  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.0     7.0     9.0     8.4    10.0    10.0 

A quick look shows means to be greater in Group 2 than in Group 1. Is this difference statistically significant?

t test: A Welch 2-sample t test in R statistical software finds a significant difference. (P-value $\approx$ 2%.) The only doubt is whether data are sufficiently nearly normally distributed for the t test to give accurate results. (Data for both groups spectacularly fail a Shapiro-Wilk test with P-values < .01. But sample sizes may be large enough for the t test to be useful anyhow.)

t.test(x1, x2)

##        Welch Two Sample t-test

## data:  x1 and x2 
## t = -2.434, df = 47.853, p-value = 0.01872
## alternative hypothesis: true difference in means is not equal to 0 
## sample estimates:
## mean of x mean of y 
##      7.28      8.40 

Wilcoxon test: The Wilcoxon test, for a difference in medians gives a (tentative) P-value of about 2%, but warns that it may not be accurate. However, there are only seven uniquely different values among the 50 subjects. So the number of ties is 'massive' and the Wilcoxon test is based on a comparison of ranks, which can be problematic when there are many ties. I would not want to trust the result of the Wilcoxon test.

wilcox.test(x1, x2)

##     Wilcoxon rank sum test with continuity correction

## data:  x1 and x2 
## W = 200.5, p-value = 0.02702
## alternative hypothesis: true location shift is not equal to 0 

## Warning message:
## In wilcox.test.default(x1, x2) : cannot compute exact p-value with ties

Permutation test. It would be tedious to derive the exact permutation distribution of $D$ for this example. The usual cure is to simulate a large number of permutations and to approximate the P-value from simulation results. Here is a brief program in R to find the approximate P-value (2.1%) of the permutation test. (You may get a slightly different P-value at each run of the program, but not enough different to matter in the interpretation. For this program, subsequent runs all gave values rounding to 2%)

m = 10^4;  d.perm = numeric(m)
all = c(x1, x2);  d.obs = mean(x1) - mean(x2)
n1 = n2 = 25
for (i in 1:m) {
  perm = sample(all, n1+n2)
  d.perm[i] = mean(perm[1:n1]) - mean(perm[(n1+1):(n1+n2)])
  }
mean(abs(d.perm) >= abs(d.obs))
## 0.0215

Here is a histogram of the approximate permutation distribution. The solid red line at the left is the observed value of $D$ for the data above. The dotted red line at the right is just as extreme (far from 0) as the observed value of $D.$ The P-value of this 2-sided permutation test is the percentage of values in the permutation distribution outside these red lines, in this case, 2.1%.

enter image description here

Conclusion: The two groups differ significantly. The t test is probably OK, because, for samples this large, the Central Limit Theorem tends to make the sample means very nearly normal even if the data are not normal. For groups as small as ten, I would certainly insist on seeing permutation test results before drawing a conclusion.

You can read more about permutation tests in this paper by Eudey. The two-sample test above is discussed, with additional examples, in Section 4.

Almost certainly, your data will look different than my fake data. Please let me know if you have trouble relating my answer to your specific data.

Note: The fake data above were generated from populations with respective means about 3/5 and 5/6 using the R code below. (So it is appropriate that the tests found a significant difference.) By using the same seed I used, you should get exactly the same data.

set.seed(1234)
x1 = ceiling(10*rbeta(25, 3, 2))
x2 = ceiling(10*rbeta(25, 5, 1))

Addendum (Your Data from Comment). Your result in the Comment seems OK. Significant at 9.3% < 10% level; sometimes optimistically called "suggestive" of significance.

If you honestly expected (before seeing data) Gp2 scores to be higher, then maybe this should be a left-sided test of $H_0: \mu_1 \ge \mu_2$ vs. $H_a: \mu_1 < \mu_2.$ if so, P-value would be 3.8% < 5% for significance at the 5% level.

x1 <- c(0,7,10,0,9,5,10,6,8,7,8,2,2,8,10,7,10) 
x2 <- c(7,4,10,10,9,10,10,9,10,7,5,10,10,10,10,5,10,2)  
all = c(x1, x2);  gp = rep(0:1; times = c(17,18))
stripchart(all~gp, meth="stack", pch=19, col=c("blue", "green3"))

enter image description here

Welch t-test gives P-value 0.09024. Repeat of permutation test with m = 10^6 iterations to reduce possibility of simulation error.

x1 <- c(0,7,10,0,9,5,10,6,8,7,8,2,2,8,10,7,10) 
x2 <- c(7,4,10,10,9,10,10,9,10,7,5,10,10,10,10,5,10,2) 
m = 10^6;  d.perm = numeric(m) 
all = c(x1, x2);  d.obs = mean(x1) - mean(x2) 
n1 = length(x1);  n2 = length(x2) 
for (i in 1:m) { perm = sample(all, n1+n2) 
  d.perm[i] = mean(perm[1:n1]) - mean(perm[(n1+1):(n1+n2)]) } 
mean(abs(d.perm) >= abs(d.obs)) 
## 0.093149
## 0.093183  # 2nd run with m=10^6

mean(d.perm < d.obs)
## 0.038349  # P-value of LEFT SIDED test


length(unique(d.perm))
## 75        # uniquely different sim. values of D (enough)

hist(d.perm, prob=T, col="skyblue2", main="Simulated Permutation Distribution")
abline(v=d.obs, col="red", lwd=2)
abline(v=-d.obs, col="red", lwd=2, lty="dashed")

enter image description here

Note: If this is for a reviewed paper, you might get criticism (as noted by @Nameless) that the permutation test involves taking sample means of ordinal data. Possible nonparametric, ordinal-oriented alternatives:

(a) Use median instead of mean in the permutation test when finding d.obs and (within the loop) when finding d.perm, but not at the end when finding the P-value. (In R, the mean of a logical vector is the proportion of its TRUEs.) Trouble is I got only only about 20 uniquely different values of d.perm that way; not quite enough for my taste. One-sided P-value 0.047.

(b) Do a Welch t test on rank-transformed data. (Ranks are appropriate for ordinal data, their means are likely not far from normal with sample sizes above 15.) From t.test(rank(all) ~ gp, alte="less"), I get (Welch, one-sided) P-value 0.03457.

  • 0
    Wow! Thank you very much for this detailed explanation. I have read about the permutation test and I have chosen it as it sounds solid to me. I will give it a try tomorrow with SPSS and will ask you again if I have trouble to adapt it to my data. My Sample size is n1 = 17 and n2 = 16.2017-01-31
  • 1
    Sample sizes of 16 and 17 _might_ be big enough that t test is OK. You can get R for free from `www.r-project.org` (Windows, Mac, or Linux). You could use my code (above pix). Start with `x1 = c(5,7,8, ...)`, similarly for `x2`. And `n1 = 17; n2 = 16`. Should run. Or give me your `x1` and `x2` and I'll try it. In any case, I'd be happy to hear how it works out.2017-01-31
  • 0
    I have started with R and used my data vectors x1 and x2. What is the threshold P-value to use? 0.05? How did you plot this nice histogram? I would love to use such a histogram in my master thesis :-) Thank you again so much for your extraordinary help!2017-01-31
  • 0
    `x1 <- c(0,7,10,0,9,5,10,6,8,7,8,2,2,8,10,7,10) x2 <- c(7,4,10,10,9,10,10,9,10,7,5,10,10,10,10,5,10,2) m = 10^4; d.perm = numeric(m) all = c(x1, x2); d.obs = mean(x1) - mean(x2) n1 = 17 n2 = 18 for (i in 1:m) { perm = sample(all, n1+n2) d.perm[i] = mean(perm[1:n1]) - mean(perm[(n1+1):(n1+n2)]) } mean(abs(d.perm) >= abs(d.obs)) ## 0.0946` Hope that is correct?2017-01-31
  • 0
    Really nice and helpful! "If you honestly expected (before seeing data) Gp2 scores to be higher, then maybe this should be a left-sided test;" Yes, I expected this or better hoped for this result ;-) So the Null hypothesis would be "the mean of group 1 and group 2 is equal" and the null hypothesis would be "the mean of group 2 is significantly higher than group's 1" for the left-sided test?2017-01-31
  • 0
    OK. See several further edits in Addendum to Answer.// One more idea several hours later. Consider Success to be 9 or 10; Then Gp 1 has 5 Successes in 17; Gp 2 has 12 in 18. One-sided 'Fisher Exact Test' for proportions has P-val 0.03. _Ad hoc_ flavor, but might be worth mentioning.2017-02-01
0

Since your response scales are ordinal, the appropriate test is the Wilcoxon rank sum test. It tests whether the ordinal values tend to be higher in one group than the other.

A significance test comparing the means - as a t-test would do - is not appropriate in your case, since the computation of means requires a cardinal scale which you do not have.

I do not use SPSS, but I am positive it has the Wilcoxon test since it is a very standard one. Good luck!

  • 0
    Thank you very much for your Response. I read sth. about scales and now I am wondering if my Response scales really is ordinal. The participant of the survey should rate how easy it was to use a Software between 1 and 10. Is this ordinal?2017-01-30
  • 1
    Wilcoxon rank-based tests do not react well to ties in the data, which will almost certainly result when data are integers between 1 and 10. Reliable software will show an error message. So I am not sure a two-sample Wilcoxon test is appropriate here.2017-01-31
  • 1
    Yes it's ordinal. Cardinal would require for example that the difference between 6 and 5 is the same as the difference between 4 and 3. This is pretty much never fulfilled with these subjective scales.2017-01-31