1
$\begingroup$

Let's say there are 2 towns, each with different public transport networks. I want to know which modes of transport people over the age of 30 tend to use more than the average for the population, and I want to test whether it's the same in both towns or different. I go and collect some data and it looks like this:

Variable | Area1 Total | Area1, 30+ | Area2 Total | Area2, 30+ Car      | 1098        | 100        | 1024        | 50 Train    | 1024        | 9          | 326         | 5 Bike     | 900         | 56         | 134         | 90 Unicycle | 10          | 2          | 51          | 10 Llama    | 2           | 2          | 1           | 0 Carpet   | 50          | 2          | 100         | 10 

I would like to know, for each mode of transport, whether more or fewer people take, for example, the train, in town A, than might be expected by chance. Let's say 50% of town A are over 30, and yet 70% of people who take the train are over 30. If age had nothing to do with transport preference, we would expect 50%, so this deviates significantly from our expectations.

So, in my line of work, we use 95% confidence as a (admittedly arbitrary) cutoff to quote statistics at. So, what is the difference between expectation and observation, and what are the confidence intervals on that difference.

Next up, what is the difference between the two towns? Are any modes of transport preferred by the over 30s in one town more than they are in the other, and again, what is the difference, and what are the confidence intervals on that difference?

I'm fairly sure it's quite simple, but I'm stuck, and any help would be greatly appreciated! The data can't be re-gathered, but I'm open to any ideas on how to run the stats on it.

EDIT:

I think I've found the answer to the overall problem. Fisher's Exact Test can test, for example:

Train:           | Town A | Town B 30+       | 100    | 200 30-       | 200    | 400 

This will determine whether there is a significant difference between train use in town A and town B, and whether there is a difference in the age groups.

I'll write it up as an answer when I'm vaguely coherent. I need sleep now.

(For anyone who's interested, this is actually part of a study into genome evolution in plants. The people are individual genes, the towns are two different genomes, and the "over 30s" is whether they appear to be part of a set of duplicate genes. Modes of transport are in fact chemical reactions which the gene affects. I'm interested in whether plants gain and lose copies of genes randomly, or because they are involved in specific processes, and whether plants which have differing ecologies gain or lose duplicates of genes in response to this. I'm a bit lost myself, so I thought I'd work out an analogy which didn't require detailed knowledge of plant genomics to answer the question.)

  • 0
    The problem with the Fisher test is it's normally for 2x2 tables. You probably don't want to analyze a hundred 2x2's at the 5% level, your type I error rate would blow. Most of the notions involved can be extended to a larger table, but you may neither need nor want that.2012-12-20

1 Answers 1

0

There are a variety of ways of tackling a problem of this kind/size. The problem with looking for differences on one variable (say in area) is that there are other important variables (like age). If you ignore the such variables you invite a host of problems (not being able to pick up important differences, or finding differences that are illusory or even reversed from the true direction of difference; this is also familiar in regression and ancova where an effect can change direction from a univariate analysis once you take into account another important covariate; the simple analysis may be misleading).

For count data, there are generalized linear models. For contingency tables, an analysis using loglinear models (a subset of GLMs) would be fairly common.

The advantages of using GLMs is that you get the ability to use models for count data like you would use ANOVA, ANCOVA and regression for continuous variables (with assumptions of normality) - you can build suitable, interpretable models that will let you make conclusions about (say) relative odds of taking one mode of transport rather than another in the two towns, for a given age group.

[It's probably better to work with your actual problem, avoiding jargon where convenient, explaining jargon where necessary, though. Incidentally, if you're going to explain the actual problem you have, stats.stackexchange.com is likely to have a greater concentration of people used to working on exactly this kind of problem; I'm a statistician myself (and there are statisticians here, certainly), but over there, that's pretty much everyone, and some of them will very likely have familiarity with problems like your actual one.]

If you'd like (here or there), I could try to help come up with models that would answer the kind of questions you're talking about.

Incidentally, in regard to your 'age' variable, it would be more typical to give the data as "Age <30" vs "Age 30+" rather than as "total" vs "subgroup". You can't compare the total with the subgroup - you'd just have to split the 'total' group up to make the comparison anyway.

Do you have suitable software - something that will fit glms / loglinear models?

Is this something you're looking to publish? Is it for a thesis? Some coursework?

  • 0
    Thanks - I will look for it but you'll likely have a bunch of answers in short order. But likely more (seemingly) silly questions to begin with.2012-12-20