When two or more of the predictors are correlated, this phenomenon is called multicollinearity. This affects the resulting coefficients by masking the underlying individual weights of the correlated variables. This is why model weights are not equal to feature importance.
Ways to deal with multicollinearity
- Looking at Variance Inflation Factor (VIf), which measures the inflation of estimated coefficients when multicollinearity exists
from statsmodels.stats.outliers_influence import variance_inflation_factor # the independent variables from dataframe X = df[['col1', 'col2', 'col3']] # VIF dataframe vif_df = pd.DataFrame() # calculating VIF for each feature vif_df["vif"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))] print(vif_df)
- Removing correlated variables
- Using PCA
This is an extreme case of multicollinearity, where a variable affects both the dependent and an independent variable. This can cause invalid correlations. For ex.
Higher consumption of ice cream -> Higher likelihood of sunburn
Here, the above conclusion seems incorrect, what could have affected both the variables is Higher temperatures leading to higher consumption of ice cream leading to higher likelihood of sunburn.
Common Reasons for confounding variables to occur
- Selection bias – data biased due to the way it was collected, eg. class imbalance
- Omitted variable bias – when important variables are omitted resulting in regression model that is biased and inconsistent
Ways to deal with confounding varibales
- Stratification – balance the dataset in such ways that confounding variables do no vary much
- Chi square test of independence – this determines whether there is a statistically significant relationship between two categorical variables
Image source: Unsplash