1
$\begingroup$

Let's say there are 2 towns, each with different public transport networks. I want to know which modes of transport people over the age of 30 tend to use more than the average for the population, and I want to test whether it's the same in both towns or different. I go and collect some data and it looks like this:

Variable | Area1 Total | Area1, 30+ | Area2 Total | Area2, 30+ Car      | 1098        | 100        | 1024        | 50 Train    | 1024        | 9          | 326         | 5 Bike     | 900         | 56         | 134         | 90 Unicycle | 10          | 2          | 51          | 10 Llama    | 2           | 2          | 1           | 0 Carpet   | 50          | 2          | 100         | 10 

I would like to know, for each mode of transport, whether more or fewer people take, for example, the train, in town A, than might be expected by chance. Let's say 50% of town A are over 30, and yet 70% of people who take the train are over 30. If age had nothing to do with transport preference, we would expect 50%, so this deviates significantly from our expectations.

So, in my line of work, we use 95% confidence as a (admittedly arbitrary) cutoff to quote statistics at. So, what is the difference between expectation and observation, and what are the confidence intervals on that difference.

Next up, what is the difference between the two towns? Are any modes of transport preferred by the over 30s in one town more than they are in the other, and again, what is the difference, and what are the confidence intervals on that difference?

I'm fairly sure it's quite simple, but I'm stuck, and any help would be greatly appreciated! The data can't be re-gathered, but I'm open to any ideas on how to run the stats on it.

EDIT:

I think I've found the answer to the overall problem. Fisher's Exact Test can test, for example:

Train:           | Town A | Town B 30+       | 100    | 200 30-       | 200    | 400 

This will determine whether there is a significant difference between train use in town A and town B, and whether there is a difference in the age groups.

I'll write it up as an answer when I'm vaguely coherent. I need sleep now.

(For anyone who's interested, this is actually part of a study into genome evolution in plants. The people are individual genes, the towns are two different genomes, and the "over 30s" is whether they appear to be part of a set of duplicate genes. Modes of transport are in fact chemical reactions which the gene affects. I'm interested in whether plants gain and lose copies of genes randomly, or because they are involved in specific processes, and whether plants which have differing ecologies gain or lose duplicates of genes in response to this. I'm a bit lost myself, so I thought I'd work out an analogy which didn't require detailed knowledge of plant genomics to answer the question.)

  • 0
    Could you explain how the question "if mode of transport is more influenced by age in area 1 or area 2" is addressed by those pairwise comparisons? In this part "difference between observed and expected (108% < X < 120%)" what does that inequality represent? What is $X$? It can't be a difference in percentages like the text suggests. When you talk about 'expected' - expected under what model? Further, I would think answering the question as phrased in your first sentence would involve a different analysis, since it seems to relate to an interaction. (& that's a lot of comparisons -- why 95%?)2012-12-19
  • 0
    Thanks for having a look - in this case, 95% is just an arbitrary confidence level which is used as convention in my field. a2012-12-19
  • 0
    It occurs to me that I might be better off looking at it flipped 90 degrees. Not sure yet2012-12-19
  • 0
    So what parameter is this an interval for: "108% < X < 120%"? What is X there? Why is it expressed as a percentage? Why is that percentage greater than 100? It seems like you're asking about a ratio, not a difference. Could you explain what you intend by "whether the two groups differ in a significant way"? Are you asking for general difference in distribution of proportions in each transport category across a variable? Which comparison is intended (Age (below 30 vs 30+)? Area1 vs area 2? Both? Or the interaction suggested by your original 1st sentence?)2012-12-19
  • 0
    So if, 30% of people in town A are over 30, I would expect 30% of a random selection of people from town A to be over 30. If 60% of train users from town A are over 30, this would be a significant deviation from that expectation. I would like to be able to quantify that deviation, with confidence intervals.2012-12-19
  • 0
    I thought the best way might be to estimate number observed/expected for each town and each mode of transport, and put confidence intervals on that. So X, the parameter, would be, for example, observed/expected for train users over the age of 30 in town A.2012-12-19
  • 0
    Then, it would be useful to compare town A to town B - so let's say 30% of all people in town A and 40% of over 30s use the train, and 10% of all people in town B and 40% of over 30s use the train, that's quite a significant difference. But how significant is it?2012-12-19
  • 0
    Anyway, thanks for the patience - I basically want to know whether the transport choices of over 30s in town A are the same as those of the over 30s in town B, which ones differ the most, and how statistically significant those differences are. So, for example, town A's council could say "our over 30s use the bus less than average, but in town B, they use the bus more than average. Therefore this may be because of differences in our bus systems"2012-12-19
  • 0
    So if what I'm doing seems ridiculously convoluted, and there's an obviously better way to do it, please tell me!2012-12-19
  • 0
    Thanks for those; in particular: "whether the transport choices of over 30s in town A are the same as those of the over 30s in town B, which ones differ the most, and how statistically significant those differences are" is close to what's needed to start constructing a reasonable answer. Take care using significant: when you say "that's quite a significant difference", that's the everyday sense of 'an important difference'. When you say "But how significant is it?" you're talking about statistical significance. It's important not to conflate the two, because they mean quite different things.2012-12-20
  • 0
    The problem with the Fisher test is it's normally for 2x2 tables. You probably don't want to analyze a hundred 2x2's at the 5% level, your type I error rate would blow. Most of the notions involved can be extended to a larger table, but you may neither need nor want that.2012-12-20

1 Answers 1

0

There are a variety of ways of tackling a problem of this kind/size. The problem with looking for differences on one variable (say in area) is that there are other important variables (like age). If you ignore the such variables you invite a host of problems (not being able to pick up important differences, or finding differences that are illusory or even reversed from the true direction of difference; this is also familiar in regression and ancova where an effect can change direction from a univariate analysis once you take into account another important covariate; the simple analysis may be misleading).

For count data, there are generalized linear models. For contingency tables, an analysis using loglinear models (a subset of GLMs) would be fairly common.

The advantages of using GLMs is that you get the ability to use models for count data like you would use ANOVA, ANCOVA and regression for continuous variables (with assumptions of normality) - you can build suitable, interpretable models that will let you make conclusions about (say) relative odds of taking one mode of transport rather than another in the two towns, for a given age group.

[It's probably better to work with your actual problem, avoiding jargon where convenient, explaining jargon where necessary, though. Incidentally, if you're going to explain the actual problem you have, stats.stackexchange.com is likely to have a greater concentration of people used to working on exactly this kind of problem; I'm a statistician myself (and there are statisticians here, certainly), but over there, that's pretty much everyone, and some of them will very likely have familiarity with problems like your actual one.]

If you'd like (here or there), I could try to help come up with models that would answer the kind of questions you're talking about.

Incidentally, in regard to your 'age' variable, it would be more typical to give the data as "Age <30" vs "Age 30+" rather than as "total" vs "subgroup". You can't compare the total with the subgroup - you'd just have to split the 'total' group up to make the comparison anyway.

Do you have suitable software - something that will fit glms / loglinear models?

Is this something you're looking to publish? Is it for a thesis? Some coursework?

  • 0
    I've made a new question over at [stats](http://stats.stackexchange.com/questions/46349/enrichment-analysis-by-gene-duplication-level), which hopefully answers your questions and is a bit clearer http://stats.stackexchange.com/questions/46349/enrichment-analysis-by-gene-duplication-level2012-12-20
  • 0
    Thanks - I will look for it but you'll likely have a bunch of answers in short order. But likely more (seemingly) silly questions to begin with.2012-12-20