3
$\begingroup$

First time poster in the math section (a few posts in the stats section) and I am looking for clarification on a variable query that I have. Basically I enjoy sports and enjoy putting a mathematical answer (where I can as I don't have a great background knowledge) to a problem so I can make an informed opinion on an event.

My current interest is to try and apply linear weights through regression analysis to goals scored in the English Premiership (and to then be able to apply the weights to making projections based on current data using Monte Carlo style analysis).

I have pulled three complete years of data from ESPN (to maintain consistency in the data source) and then have broken them down. I am treating goals scored as the dependent variable and was treating all other data collated as independent (e.g. things such as shots, shots on goal etc.). After breaking down the data that I had I found that I could get my best r2 value (in excess of 0.9), when adding a shooting percentage variable in (e.g. goals scored/shots on goal * 100 to get say 36 instead of 0.36).

My question is; can I treat shooting% as an independent variable though as it is essentially a function of goals scored and shots on goal? Ideally I would like to treat it as independent as it gives me a cleaner result based on the data that I can obtain and do feel it is a reflection of accuracy of the person scoring, but my gut feeling is that it is dependent on the other two variables? I would be grateful to get an opinion on this so that I can then go away and try obtain more data if my logic isn't suitable.

Many thanks,

  • 0
    Yes that would be correct a$n$d I didn't think of that e.g. with Wayne Rooney of Manchester United over the last couple of seasons when he has played in excess of 30 games, of his total shots less, only between 40-50% have been on target and of the ones on target only 38-44% have resulted in goals. Of the type of data that was easily available I couldn't see any other of trying to derive linear weights for goals scored (but had concerns about variables being independent). I take it the variables do have to be independent of the dependent variable to derive linear weights?2012-07-29

2 Answers 2

1

What you are calling dependent and independent variables are better referred to as your target and predictor variables. This makes clear what the relationship is supposed to be between them - you use the predictors to predict the target. The words dependent and independent have particular meanings in mathematics (as alluded to in Henning's comment) and it introduces unnecessary confusion to overload them too much.

Now, if you want to make forecasts then clearly you can't use the percentage of shots on target as an indicator, because you don't know the percentage of shots on target until the game is over! You might consider including a particular team's past history of shots on target as a regressor (e.g. their percentage of shots on target in the last twenty games they played) but you can't use the value from the match whose score you are trying to predict.

If you just want an explanatory model rather than a predictive one, then you could use the percentage of shots on target. However, this is a bit dubious, because of the exact relationship that exists between three of the variables:

$\textrm{Goals Scored} = \textrm{Attempts on Goal} \times \textrm{Percentage of Shots on Target}$

In some sense, you already understand why there is a relationship between Goals Scored and Percentage of Shots on Target - it's given by this formula! Indeed, if you included an interaction term between Attempts on Goal and Percentage of Shots on Target, then your regression would pick this out as the only significant predictor.

For these reasons I recommend that you don't include the percentage of shots on target in your model.

  • 0
    I managed to get hold of a larger data set with more variables so I had a slight variation to this question here:http://stats.stackexchange.com/questions/34905/attempting-to-define-linear-weights-for-goals-scored-in-the-english-premiership - which may give you a better idea of what I am trying to do. For what it's worth though I would still be keen to determine (in the context of what I am trying to and achieve) if I can use shooting% as a variable?2012-08-23
0

Why don't you make a scatterplot of two different variables at a time and see if there's any correlation. You have several variables

  • goals scored
  • shooting percentage
  • shots
  • shots on goal

and you might consider these variables conditional on other events in the game or characteristics of players.

Your scatter plots will look something like this and then it will be your job to decide if there's a relationship (linear, quadratic, etc.)

  • 0
    I did something similar in Excel with the data and worked out the Correlation Coefficients of the variables in relation to the number of goals scored. Shots returned 0.68, Shots on Goal returned 0.79 - which you would expect to be more significant than overall total shots, whereas Shooting% only returned 0.22 which does suggest it is more insignificant than the other two variables.2012-08-05