1
$\begingroup$

I have the following problem: I'm performing a multivariate logistic regression on several variables each of which has a nominal scale. I want to avoid multicollinearity in my regression. If the variables were continuous I could compute the variance inflation factor (VIF) and look for variables with a high VIF. If the variables were ordinally scaled I could compute Spearmon's rank correlation coefficients for several pairs of variables and compare the computed value with a certain threshold. But what do I do if the variables are just nominally scaled? One idea would be to perform a pairwise chi-square test for independece, but the different variables don't all have the same codomains. So that would be another problem. Is there a possibility for solving this problem?

Thanks in advance!

2 Answers 2

0

Strictly speaking, multicollinearity is a concept which is unique to real-valued covariates. It's also a specialized indicator of dependence between covariates. You have the right idea in that it's important to avoid covariates that are too "mutually dependent" - even if they have categorical domains. In fact, you can actually use the idea of multicollinearity to motivate a statistic to measure such dependence among categorical covariates. I'll try to do this through an example.

Suppose you have two covariates, $X_1$ and $X_2$, which are both continuous. The VIF is equal to $1/(1-R^2)$, where $R^2$ is the squared-correlation between $X_1$ and $X_2$. Thus, multicollinearity simply depends how well $X_1$ predicts $X_2$ as assessed by $R^2$.

Now, suppose $X_1$ and $X_2$ are categorical. You can still use $X_1$ to predict $X_2$ (e.g. through a logistic regression). But you wouldn't assess the quality of the prediction using the usual $R^2$ in this case. There are many different pseudo-$R^2$ formulas you could consider. I think one useful approach would be to consider one based on the entropy reduction. For instance, the marginal entropy, $H(X_2)$ of $X_2$ can easily be estimated. Additionally, the conditional entropy, $H(X_2|X_1)$ can be estimated nearly as easily. You can then develop a pseudo-$R^2$ with the formula, $$R^2_{pseudo} = 1 - \dfrac {\hat{H}(X_2|X_1)} {\hat{H}(X_1)}$$ Large values of $R^2_{pseudo}$ would then imply increased dependence among the covariates and small values would indicate otherwise.

-1

The vif function in the car package for R implements generalized variance inflation factors. The documentation references [1]. This function can be used in conjunction with a GLM---I've just tried it on a logistic regression with multiple categorical independent variables and spits out numbers.

> library(car)
> m = glm(am~cyl+vs+carb+hp, data=mtcars, family=binomial)
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred 
> car::vif(m)
         cyl           vs         carb           hp 
2.017467e+07 2.017465e+07 5.926166e+00 1.639876e+01

[1] Fox, J. and Monette, G. (1992) Generalized collinearity diagnostics. JASA, 87, 178–183.