Thanks for showing interest and wanting to help out.
My aim is to develop a model that - as accurately as possible - predicts how entities in a population will either cooperate or defect, as a % of total population. For this purpose, I have 70 predictor variables, however, not all of them may be significant (some are though). There could be a degree of multicollinearity for these variables. There are other variables that could potentially affect the outcome, but they are currently unknown. I have approximately 300 datapoints.
So far, I have used the glmfit
function in Matlab to create a binary logistic regression model for all predictor variables.
Now, my statistics expertise is limited at best (I'm sorry about that), and I struggle to choose how to proceed at this point. I would very much appreciate if you could help me out with solving following questions in matlab:
- How do I best assess the accuracy of the model?
- Would it be better to reduce the number of predictor variables to improve the accuracy of the model? If so, how should I best do this?
- How do I check whether multicollinearity is significant? If so, what actions should I take to improve the model?
- What outputs/plots should I produce to demonstrate the above?
- Finally, is there a better way of doing things?
I would very much appreciate your help. Sorry if some of this seems basic - I assure you I have read up on this, but I find myself unable to make an informed decision as to how I should proceed to obtain optimum results.
Thank you very much for your time.
EDIT: For example, would it be a good idea to look at the individual p-values for all the predictors and eliminate all those that fail a chosen significance level (say 0.05), then reconstructing the model with predictors that pass the test, and then see whether a better deviance (D) is obtained? How would I be able to judge whether the model is suitable, even if the deviance is better? Is there a better way of doing this? I just don't understand the maths behind these statistics well enough in order to choose an effective strategy.
EDIT 2: Thanks to Zhiyong Wang, I have managed to do a LASSO on my data to discriminate predictor variables... I'm now down to 14. However, some of the p-values are still very high, and I'm not quite sure how I should continue to process my model. Please find below my diagnosis:
Estimated Coefficients: Estimate SE tStat (Intercept) -9.3957 0.45246 -20.766 x2 0.0032055 0.0043646 0.73443 x3 -0.0095759 0.0022003 -4.352 x4 0.0023242 0.00090184 2.5772 x5 0.0033171 0.001955 1.6968 x7 0.0017115 0.00090373 1.8938 x9 0.0031377 0.0013612 2.3051 x11 0.00024809 0.0013823 0.17947 x16 0.0014808 0.0021081 0.70244 x22 -0.0017803 0.0014742 -1.2077 x26 -0.0025935 0.0045821 -0.56601 x35 -0.0077807 0.014286 -0.54464 x37 -0.0086488 0.0079046 -1.0942 x45 -0.0038264 0.0019328 -1.9797 x52 0.00032738 0.0043498 0.075264 pValue (Intercept) 8.7732e-96 x2 0.46269 x3 1.349e-05 x4 0.0099602 x5 0.089743 x7 0.058253 x9 0.021161 x11 0.85757 x16 0.48241 x22 0.22718 x26 0.57139 x35 0.586 x37 0.27389 x45 0.047742 x52 0.94 126 observations, 111 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 319, p-value = 1.51e-59
How do I best proceed from there? Thank you very much.