4
$\begingroup$

Thanks for showing interest and wanting to help out.

My aim is to develop a model that - as accurately as possible - predicts how entities in a population will either cooperate or defect, as a % of total population. For this purpose, I have 70 predictor variables, however, not all of them may be significant (some are though). There could be a degree of multicollinearity for these variables. There are other variables that could potentially affect the outcome, but they are currently unknown. I have approximately 300 datapoints.

So far, I have used the glmfit function in Matlab to create a binary logistic regression model for all predictor variables.

Now, my statistics expertise is limited at best (I'm sorry about that), and I struggle to choose how to proceed at this point. I would very much appreciate if you could help me out with solving following questions in matlab:

  1. How do I best assess the accuracy of the model?
  2. Would it be better to reduce the number of predictor variables to improve the accuracy of the model? If so, how should I best do this?
  3. How do I check whether multicollinearity is significant? If so, what actions should I take to improve the model?
  4. What outputs/plots should I produce to demonstrate the above?
  5. Finally, is there a better way of doing things?

I would very much appreciate your help. Sorry if some of this seems basic - I assure you I have read up on this, but I find myself unable to make an informed decision as to how I should proceed to obtain optimum results.

Thank you very much for your time.

EDIT: For example, would it be a good idea to look at the individual p-values for all the predictors and eliminate all those that fail a chosen significance level (say 0.05), then reconstructing the model with predictors that pass the test, and then see whether a better deviance (D) is obtained? How would I be able to judge whether the model is suitable, even if the deviance is better? Is there a better way of doing this? I just don't understand the maths behind these statistics well enough in order to choose an effective strategy.

EDIT 2: Thanks to Zhiyong Wang, I have managed to do a LASSO on my data to discriminate predictor variables... I'm now down to 14. However, some of the p-values are still very high, and I'm not quite sure how I should continue to process my model. Please find below my diagnosis:

Estimated Coefficients:
                   Estimate      SE            tStat   
    (Intercept)       -9.3957       0.45246     -20.766
    x2              0.0032055     0.0043646     0.73443
    x3             -0.0095759     0.0022003      -4.352
    x4              0.0023242    0.00090184      2.5772
    x5              0.0033171      0.001955      1.6968
    x7              0.0017115    0.00090373      1.8938
    x9              0.0031377     0.0013612      2.3051
    x11            0.00024809     0.0013823     0.17947
    x16             0.0014808     0.0021081     0.70244
    x22            -0.0017803     0.0014742     -1.2077
    x26            -0.0025935     0.0045821    -0.56601
    x35            -0.0077807      0.014286    -0.54464
    x37            -0.0086488     0.0079046     -1.0942
    x45            -0.0038264     0.0019328     -1.9797
    x52            0.00032738     0.0043498    0.075264


                   pValue    
    (Intercept)    8.7732e-96
    x2                0.46269
    x3              1.349e-05
    x4              0.0099602
    x5               0.089743
    x7               0.058253
    x9               0.021161
    x11               0.85757
    x16               0.48241
    x22               0.22718
    x26               0.57139
    x35                 0.586
    x37               0.27389
    x45              0.047742
    x52                  0.94


126 observations, 111 error degrees of freedom
Dispersion: 1
Chi^2-statistic vs. constant model: 319, p-value = 1.51e-59

How do I best proceed from there? Thank you very much.

2 Answers 2

2
1.How do I best assess the accuracy of the model?

Besides using the terms in hypothesis test, like p-value, you can try to compute the precision and recall (see wiki), if your response value is categorical.

2.Would it be better to reduce the number of predictor variables to improve the accuracy of the model? If so, how should I best do this?

You can improve the accuracy as while as reducing the number of predictor by adding L-1 norm of the weights of linear regression in the object function. The method called LASSO. There will be an extra parameter you need to tune to find a balance between sparsity of your model in term of number of predictor variables and the accuracy.

3.How do I check whether multicollinearity is significant? If so, what actions 

should I take to improve the model?

You can achieve this by adding interaction term, like $x_1x_2$ to the set of predictor variables, where $x_1,x_2$ is your original predictor variables.

4.What outputs/plots should I produce to demonstrate the above?

You can try ROC curve, see wiki for detail.

5.Finally, is there a better way of doing things?

I think it depends on your specific problem.

  • 0
    Thank you very much Zhiyong Wang, this really helps me to get a better idea on how to proceed. I am working on the LASSO right now. I'm wondering which method I should be using in order to estimate deviance, what do you think? Cross validation? if so how many fold? Or just use x and y to fit the model and then just estimate the deviance from there?2012-09-22
  • 0
    You can use cross validation, out-of-bag or leave-one-out. I usually do cross validation of 5 folds.2012-09-23
  • 0
    Thanks, I'll keep that in mind. One last question - what ratio of datapoints / predictor variables do I realistically need? What's the justification for this?2012-09-23
  • 0
    It is hard to say. It depends on two factors, the correlation between data points and the correlation between predictor variables. Maybe there is no simply ratio you can follow.2012-09-23
0

There are several issues you need to consider before getting technical (trying to model and solve your problem).

Firstly, you don't have to use logistic regression; you could use linear regression first, where your y-values of (0,1) are recoded into (-1,+1). If this model works well then your problem is linearly separable. Also, turn your two y-values into two classes (1,2) and then use classification analysis. Under this approach, you could perform feature selection and apply either Mann-Whitney test or t-tests, or information-based metrics such as the Gini index or entropy to determine which predictors best predict class membership of each of the 300 objects. Knowing which features (predictors) best predict class membership, you cold then use classification analysis for your 2-class problem using k-nearest neighbors, Naive Bayes, linear vector quantization, kernel regression, or more complex classifiers such as random forests (which will provide feature importance plots of your 70 variables), support vector machines, artificial neural networks (simple back-propagation with logistic activation functions on the input side, and linear or softmax function on the output side), and CART (decision trees). Metaheuristics (genetic algorithms, covariance-matrix self-adaptation, particle swarm optimizationn, ant colony optimization) could also be used to solve your 2-class problem.

For logistic, model building strategies commonly involve running univariate regression models for all of your predictors, and then filtering out single predictors whose $P$-values are not less than say 0.25. Then work on multivariable models for the predictors with univariate $P<0.25$.

There is also a concern that you are using 70 variables for 300 records(objects): the rule of thumb in regression is that you have ten times the number of records than you have variables so in your case you would nominally require $70 \times 10 = 700$ records. When the number of records is low compared with the number of predictors, you will be approaching an overparametrized model. As the number of variables approaches the number of records $(p\rightarrow n)$, the number of zero eigenvalues in the variance-covariance matrix, or inverse of the negative information (Hessian) matrix of second partial derivatives of the log-likelihood w.r.t. coefficients, increases.

Another alternative would be to reduce the dimensions of the $p=70$ predictors down to say $p=10$ using principal components analysis. In this case, the 10 dimensions would be orthogonal (zero-correlation), so multicollinearity wouldn't be an issue.

If you were going to stick with logistic, look into effects of overly influential observations via Pearson and deviance residuals, leverage residuals (diagonal elements of the Hat matrix), DFFITS, DFBETAS, and then effects of multicollinearity on coefficient via variance inflation factors (VIFs), covariance ratios, and condition numbers.

Overall, the issues are:

  1. Do you have to use logistic regression?

  2. You have a large number of predictors compared with records.

  3. Can you reduce the 70 dimensions down to 10?

If not, use regression diagnostics to identify (a) overly influential records and (b) variables with large multicollinearity effects.