4
$\begingroup$

I'm trying to wrap my head around how to interpret probabilities. Specifically, I work in sociolinguistics where language items (e.g. presence or absence of R at the end of words in English) are treated as variables whose realizations are associated with extra-linguistic variables. One takes a corpus of speech, finds all the tokens of a particular variable, determines the relative frequencies for the realizations of that variable and analyzes it according to various social variables (e.g. age of speakers, sex of speakers, etc.). The relative frequencies are typically converted to probabilities which are thought by some to represent the grammatical rules of the speakers in the corpus.

The problem I have with this is at the point where relative frequencies are converted to probabilities. Language is thought to be constantly changing, but to me, probabilities seem to suggest that there's a static underlying system that generates the data that's being observed. A simple example: if one measures how many nights in a row the moon comes out, one can calculate a relative frequency from that and then go further and calculate a probability that it will come out on any given night. That probability will likely be 100% because what's actually being observed is a static system where the moon is orbiting the Earth according to physical laws. A probability in this case seems to be simply a calculation of an observation of the results produced by a law in action that we have not yet identified.

If this is an accurate description of what probabilities actually represent, then does it makes sense to calculate them when working with a system that is constantly changing, e.g. language and its grammatical rules? A grammatical rule that says whether speakers of a particular dialect of English should pronounce an R at the end of words or not is not static; it's expected to change. If a group of people produce an R for 65% of all tokens, that may very well mean that they used to not produce an R but are starting to produce it more and more. I'm not exactly sure that there's utility in converting that relative frequency into a probability if the relative frequency simply represents the state the system was in at the time and not a static system where every single speaker makes sure to produce R 65% of the time and not the rest of the time.

If my question points to an ongoing debate in probability theory as opposed to something that's been settled among mathematicians, what are the major arguments for how to interpret probabilities? Are there important articles in the literature that I could read to understand how mathematicians approach this topic?

  • 0
    Maybe you can find some.insight here: https://en.m.wikipedia.org/wiki/Heteroscedasticity2017-01-30
  • 0
    As a general matter, I'd say that you need to specify an underlying model. Then, yes, it is possible that observed statistics will allow you to calibrate the parameters of that model. A priori, it certainly isn;t clear to me what underlying factors drive your $65\%$. My sense (rather vague) was that there were identifiable laws which govern (or at least influence) patterns in linguistic changes. If so, it is certainly possible to imagine a model in which such laws are at least loosely quantified and keyed to (possibly) probabilistic factors.2017-01-30

1 Answers 1

3

My view is a probability is anything that follows the probability axioms. If the theorems of probability apply to it, it's a probability.

There's two main interpretations of probability: the Bayesian and frequency interpretations. Under the Bayesian interpretation, a probability represents a degree of confidence. Under the frequency interpretation, a probability represents the frequency of an event in an infinite number of trials. (or if you prefer to avoid talking about infinities, what the frequency approaches as you do more trials.)

There used to be fierce debate in statistics over which interpretation was appropriate for statistical analysis. For example if you read the introduction to Feller's 1950 probability textbook, he will carefully say that probability theory is for statistical probability, which he defines according to the frequency interpretation, and if you want to learn about degrees of confidence you should go get a textbook on inductive logic. But the 1939 textbook by Jeffreys takes the Bayesian degree of confidence interpretation. As an example of the debate you could read Confidence Intervals vs Bayesian Intervals, a talk by the Bayesian ET Jaynes, with responses by frequentist statisticians, and counter-responses by Jaynes. The tone is often unfriendly.

However I think by now statisticians are more or less over the debate and use both interpretations fluidly. Which makes sense because, to me, it's a non-issue: frequencies obviously obey the probability laws, and degrees of confidence do under certain conditions, so they are both equally valid interpretations.

There's a great article on interpretations of probability in the Stanford Encyclopedia of Philosophy.

So, back to your scenario. There are actually many interpretations of the probability you could take. For example, you could consider the Bayesian degree of confidence that a speaker you meet on the street will pronounce the R the first time you hear it from them. I'm going to focus, with this abundance of choices, on this interpretation: the fraction of tokens for which a particular individual (say, the next speaker you meet) pronounces the R's. I'll explain in a bit why I think I can pick an interpretation arbitrarily like this.

This is a number that has a certain theoretical existence; you can't really ask them to pronounce all tokens. So, there's counterfactually what they would say if they were asked to. Let's call this number $p_R$.

So, what are you doing with your corpus? You say you are "converting" to a probability, but I think the more appropriate term is "estimating." "Estimation" is the standard term, in statistics, for guessing an unknown number, based on some random data which has some connection to that number. So, what you are doing is estimating $p_R$ based on the frequencies within the corpus. This will be a good estimate if the corpus is from people similar to the next speaker you meet. But if there's a lot of variation--the speakers in the corpus are from a variety of times, or a variety of places, making them unlike that speaker--it will be less of a good estimate.

This is why I think the choice of interpretation is a bit arbitrary. Because, the frequency in the corpus is a good estimate of more than one thing. It's a good estimate of $p_R$, but it's also a good estimate of, for example, the frequency in a second, unobserved corpus. Or the frequency if you could hear all conversations simultaneously right now. Etc. Once you let go of the idea that you are doing a strictly logically justified conversion, it could be an estimate for many things.

However, you may notice that I'm dismissing the issue of variation within the corpus to some realm outside of probability theory. I'm saying, if there's a lot of temporal variation, it's a bad estimate for the language as currently spoken, but I'm saying it as some qualitative thing, not subject to calculation. But actually there are mathematical tools within probability theory for dealing with this kind of thing.

Suppose that the language changes every year. Let's consider an example of some probability, such as the probability of pronouncing an R, an event we'll name $R$.

Let $P(R|Y=y)$ be the probability of pronouncing the R in year $y$. This is a conditional probability.

Then, suppose you have a corpus, which is from a distribution of years. Let $P(Y=y)$ be the probability of a conversation from year $y$ ending up in your corpus.

Now, let's define the marginal probability $P(R)$ to be the probability of a random pronunciation that makes it into your corpus having the R pronounced.

This marginal probability is related to the conditional probability by the law of total probability:

$$P(R) = \sum_y P(R|Y=y)P(Y=y)$$

The frequency you observe will likely be close to $P(R)$.

You can view this as a weighted average. (Consider that if each of the last ten years has equal probability of inclusion, then $P(Y=y)=1/10$, there are ten conditional probabilities, and $P(R)$ is the average of them.) If what you're really interested in is $P(R|Y=2016)$, then $P(R)$ will only be close if $P(Y=y)$ is high for nearby years, since those will have similar conditional probabilities.

One keyword of interest might be hierarchical model: this is a simple hierarchical model in which there's a distribution of years, and then under that in the hierarchy a distribution of whether you pronounce R.

So, you asked if this conversion from a frequency to a probability assumes that you have a static system. My answer is: if what you are interested in is the language as spoken in the present moment, then a corpus of more recent speech provides a better estimate than a corpus going back several decades. As I type that out, I feel a bit stupid for saying something so obvious sounding. To go back to being philosophical and fancy, I'd say, I think you should reframe it from what do we have to assume in order for this conversion to be logically justified, to what are the conditions under which we get a good or bad estimate of what we're really interested in.