Understanding Multicollinearity and Confounding Variables in Regression

Multicollinearity

When two or more of the predictors are correlated, this phenomenon is called multicollinearity. This affects the resulting coefficients by masking the underlying individual weights of the correlated variables. This is why model weights are not equal to feature importance.

Ways to deal with multicollinearity

Looking at Variance Inflation Factor (VIf), which measures the inflation of estimated coefficients when multicollinearity exists

from statsmodels.stats.outliers_influence import variance_inflation_factor

# the independent variables from dataframe
X = df[['col1', 'col2', 'col3']]

# VIF dataframe
vif_df = pd.DataFrame()

# calculating VIF for each feature
vif_df["vif"] = [variance_inflation_factor(X.values, i)
						for i in range(len(X.columns))]

print(vif_df)

Removing correlated variables
Using PCA

Confounding Varibales

This is an extreme case of multicollinearity, where a variable affects both the dependent and an independent variable. This can cause invalid correlations. For ex.

Higher consumption of ice cream -> Higher likelihood of sunburn

Here, the above conclusion seems incorrect, what could have affected both the variables is Higher temperatures leading to higher consumption of ice cream leading to higher likelihood of sunburn.

Common Reasons for confounding variables to occur

Selection bias – data biased due to the way it was collected, eg. class imbalance
Omitted variable bias – when important variables are omitted resulting in regression model that is biased and inconsistent

Ways to deal with confounding varibales

Stratification – balance the dataset in such ways that confounding variables do no vary much
Chi square test of independence – this determines whether there is a statistically significant relationship between two categorical variables

Image source: Unsplash

Multicollinearity

Confounding Varibales

Share this:

Related

Leave a comment Cancel reply