Let's say there are 2 towns, each with different public transport networks. I want to know which modes of transport people over the age of 30 tend to use more than the average for the population, and I want to test whether it's the same in both towns or different. I go and collect some data and it looks like this:
Variable | Area1 Total | Area1, 30+ | Area2 Total | Area2, 30+ Car | 1098 | 100 | 1024 | 50 Train | 1024 | 9 | 326 | 5 Bike | 900 | 56 | 134 | 90 Unicycle | 10 | 2 | 51 | 10 Llama | 2 | 2 | 1 | 0 Carpet | 50 | 2 | 100 | 10
I would like to know, for each mode of transport, whether more or fewer people take, for example, the train, in town A, than might be expected by chance. Let's say 50% of town A are over 30, and yet 70% of people who take the train are over 30. If age had nothing to do with transport preference, we would expect 50%, so this deviates significantly from our expectations.
So, in my line of work, we use 95% confidence as a (admittedly arbitrary) cutoff to quote statistics at. So, what is the difference between expectation and observation, and what are the confidence intervals on that difference.
Next up, what is the difference between the two towns? Are any modes of transport preferred by the over 30s in one town more than they are in the other, and again, what is the difference, and what are the confidence intervals on that difference?
I'm fairly sure it's quite simple, but I'm stuck, and any help would be greatly appreciated! The data can't be re-gathered, but I'm open to any ideas on how to run the stats on it.
EDIT:
I think I've found the answer to the overall problem. Fisher's Exact Test can test, for example:
Train: | Town A | Town B 30+ | 100 | 200 30- | 200 | 400
This will determine whether there is a significant difference between train use in town A and town B, and whether there is a difference in the age groups.
I'll write it up as an answer when I'm vaguely coherent. I need sleep now.
(For anyone who's interested, this is actually part of a study into genome evolution in plants. The people are individual genes, the towns are two different genomes, and the "over 30s" is whether they appear to be part of a set of duplicate genes. Modes of transport are in fact chemical reactions which the gene affects. I'm interested in whether plants gain and lose copies of genes randomly, or because they are involved in specific processes, and whether plants which have differing ecologies gain or lose duplicates of genes in response to this. I'm a bit lost myself, so I thought I'd work out an analogy which didn't require detailed knowledge of plant genomics to answer the question.)