0
$\begingroup$


Problem

I'd appreciate some ideas on how to define a formula to estimate the value of a future data point for a continuously sampled event, based on past measurements and their tendency.

At any given time, I have exactly 15 past measurements of the event.

Let's assume that what I'm trying to predict is the free throw(FT) accuracy(%) of a basketball player on Game 18.
My 15 past measurements are his FT% on Games 3, 4, 5, (...), 16 and 17.


Approach

I'm able to specify three concepts that my formula should account for:

1. Consistency
On game 18, a performer that has been scoring consistently close to his average accuracy (low st. deviation) is likely to do it again, much more so than a player with the exact same average but much higher deviation.

2. Tendency of accuracy score
E.g.: Let's assume that a basketball performer had a FT% of 80 on games 3 to 13. Games 14 to 17 were a disaster, with his FT% dropping to values around 50.
For Game 18, altough his average FT% is 72, data indicates a recent drop in form, and the player is likely to score way below his average %.
I'd say 60 would be an acceptable estimation in this scenario.
Giving more weight to recent events should be enough to add this notion to the formula, I think.

3. Tendency of deviation
A player that used to be consistent but has recently shown high deviation is likely to score again far from the average - more likely than the simple deviation average (as in 1) ) seems to indicate.
(e.g., a player with a FT% average of 60 that has never scored off the 50-70 range until game 11 - recent history (games 12 to 17) being crazy up-and-down, with some values in the 90s, as well as in the 30s).

I suppose that this kind of formula will be heavily based on deviation values, adequately weighed for adding importance to the most recent data points, maybe applying some offset based on positive/negative tendencies.
Should probability also play a part here?


Remarks

I strongly feel that there must be a standard/known method to perform this kind of analysis, hence this question before trying to craft some intricate approach.
Standard or not, it should encompass the three concepts that I've mentioned above - I'm open to suggestions for additional metrics that could help the estimation!

An important clarification: a formula that outputs an estimated range (confidence region) instead of a single value is also acceptable. For my particular application I'll end up computing a single value from that range, but that's irrelevant to the issue in question.

Thanks!

1 Answers 1

0

I would look at this as a time series of FT % in games. What you left out of your discussion was shots taken. A player's playing time will vary from game to game and the game situation will dictate a lot about the team scoring as well as individual scoring. Getting to the foul line also depends of course on getting fouled. There could be games were the players gets few or no opportunities. The variability of the proportion estimate depends greatly on shots taken. With just a few shots it will be highly variable (whether or not the player is consistent in his shooting). This is strictly a statistical sample size issue. If you have the data in terms of success or failure on individual shots you could use logistic regression (maybe include total shots in the game as a covariate as well as past success rate). But although total shots taken could be useful in fitting a model, you could not use it for prediction since you will not know how many foul shots the player will take in the next game.

A time series approach would look at successive FT percentages over games in their time order. You could use an autoregressive model to predict future FT percentage results as a function past results the coefficients of the autoregressive terms would determine how much weight is put on the recent past over the distant past. For these models confidence intervals and prediction intervals can be computed (assuming Gaussian errors).

Of course probability plays a role in the outcomes and these are probabilistic models. Estimation is generally done using conditional least squares (or conditional maximum likelihood in the case of Gaussian error terms.

A binomial distribution with the success probability possibly changing from game to game might be a reasonable assumption for the number of FT successes in the game. If the player takes 20 or more foul shots in the game, the Gaussian approximation to the binomial may be appropriate which means that Gaussian noise for the time series could be a reasonable assumption. This would not work well for games with say just 1-4 foul shots.

After giving a well-reasoned answer to the basketball problem I learn that the problem has to do with likeness scores and basketball was an analogy. Anyway given that you expect the likeness scores to be related to the recent past scores I would suggest that an ARIMA model might be appropriate. You might want to include covariates as well if you have some. I think the time series model could be selected to meet your requirements and predict future scores. The Gaussian error assumption may or may not be good. I am not sure. But if a Gaussian approximation to the residuals works reasonably well the prediction intervals that can be generated from the model could be useful to you.

  • 0
    If the sample acf is positive and slowly declining it indicates nonstationarity. When the correlations go to 0 at something like a geometric rate the time series can be considered stationary. This is a practical rule of thumb. There is no formal way to actually test for stationarity without some assumed model form.2012-09-07