1. Confounding and collinearity

    Confounding and collinearity

    Introduction

    In this blog article, I will discuss about the bias introduced in estimation of coefficient of a given explanatory variable due to the presence of confounding factors. After that, I will try to demonstrate about the effect of variable collinearity on estimation of coefficient.

    Underlying structure

    Let us assume that we know about the underlying relationship between the explanatory variable (E), outcome variable (O), confounder ( C ), variable correlated with E (VE) and variable correlated with O (VO).

    Let us assume that the real underlying relation between various variables are as given below:

    O = 2.E + 2.C + 2.VO + 0.VE + error
    
    E, C and VE are correlated with each other with correlation coef of 0.7 each.
    
    All the random variables are normally distributed, with mean 0 and sd 5.  error is distributed as standard normal distribution.
    

    The directed acyclic graph depicts the relationship between the variables.

    Underlying structure

    Samples for simulation

    library(MASS)
    library(dplyr)
    
    ## 
    ## Attaching package: 'dplyr'
    ## 
    ## The following objects are masked from 'package:Hmisc':
    ## 
    ##     src, summarize
    ## 
    ## The following objects are masked from 'package:lubridate':
    ## 
    ##     intersect, setdiff, union
    ## 
    ## The following object is masked from 'package:MASS':
    ## 
    ##     select
    ## 
    ## The following objects are masked from 'package:stats':
    ## 
    ##     filter, lag
    ## 
    ## The following objects are masked from 'package:base':
    ## 
    ##     intersect, setdiff, setequal, union
    
    set.seed(100)
    sam <- mvrnorm(n = 10000, mu = c(0, 0, 0), Sigma = matrix(c(5, 4, 4, 4, 5, 4, 
        4, 4, 5), 3, 3, byrow = T))
    df <- data.frame(E = sam[, 1], C = sam[, 2], VE = sam[, 3])
    df <- mutate(df, VO = rnorm(10000, 0, 5), O = 2 * E + 2 * C + 2 * VO + rnorm(10000))
    

    Scenarios

    Scenario 1 (Only E as covariate, typical univariate analysis)

    mod.E <- lm(O ~ E, data = df)
    summary(mod.E)
    
    ## 
    ## Call:
    ## lm(formula = O ~ E, data = df)
    ## 
    ## Residuals:
    ##    Min     1Q Median     3Q    Max 
    ## -40.47  -7.05  -0.05   7.01  45.04 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)   0.0069     0.1034    0.07     0.95    
    ## E             3.6420     0.0465   78.40   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 10.3 on 9998 degrees of freedom
    ## Multiple R-squared:  0.381,  Adjusted R-squared:  0.381 
    ## F-statistic: 6.15e+03 on 1 and 9998 DF,  p-value: <2e-16
    
    confint(mod.E)
    
    ##               2.5 % 97.5 %
    ## (Intercept) -0.1959 0.2097
    ## E            3.5509 3.7330
    

    The model overestimates coefficient of E, its value is the sum of coef of E and C. This type of bias is known as bias due to confounding and occurs when confounders are not included as co-variates.

    Scenario 2 (E and C as covariates)

    This model is incorrect as it does not include the other covariate, VO.

    mod.EC <- lm(O ~ E + C, data = df)
    summary(mod.EC)
    
    ## 
    ## Call:
    ## lm(formula = O ~ E + C, data = df)
    ## 
    ## Residuals:
    ##    Min     1Q Median     3Q    Max 
    ## -35.99  -6.75  -0.11   6.71  43.32 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)  -0.0194     0.0998   -0.19     0.85    
    ## E             2.0293     0.0741   27.40   <2e-16 ***
    ## C             2.0231     0.0740   27.35   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 9.98 on 9997 degrees of freedom
    ## Multiple R-squared:  0.424,  Adjusted R-squared:  0.424 
    ## F-statistic: 3.68e+03 on 2 and 9997 DF,  p-value: <2e-16
    
    confint(mod.EC)
    
    ##              2.5 % 97.5 %
    ## (Intercept) -0.215 0.1762
    ## E            1.884 2.1745
    ## C            1.878 2.1681
    

    Due to addition of C, the estimate of E is unbiased. Exclusion of VO has not influenced the coefficient of E.

    Scenario 3 (E and VO as covariates)

    E and VO are not correlated. C has been excluded from the model.

    mod.EVO <- lm(O ~ E + VO, data = df)
    summary(mod.EVO)
    
    ## 
    ## Call:
    ## lm(formula = O ~ E + VO, data = df)
    ## 
    ## Residuals:
    ##    Min     1Q Median     3Q    Max 
    ## -9.638 -1.938  0.005  1.954 12.780 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)  0.03893    0.02884    1.35     0.18    
    ## E            3.59175    0.01295  277.27   <2e-16 ***
    ## VO           2.00167    0.00581  344.39   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 2.88 on 9997 degrees of freedom
    ## Multiple R-squared:  0.952,  Adjusted R-squared:  0.952 
    ## F-statistic: 9.88e+04 on 2 and 9997 DF,  p-value: <2e-16
    
    confint(mod.EVO)
    
    ##                2.5 %  97.5 %
    ## (Intercept) -0.01761 0.09546
    ## E            3.56636 3.61714
    ## VO           1.99028 2.01307
    

    There is again overestimation of coefficient of E (as in scenario 1), due to exclusion of C. Estimate of VO is unbiased as it has no correlation with either E, VO or VE.

    Scenario 4 (E and VE as covariates)

    E and VE are correlated, but VE is not associated with O.

    mod.EVE <- lm(O ~ E + VE, data = df)
    summary(mod.EVE)
    
    ## 
    ## Call:
    ## lm(formula = O ~ E + VE, data = df)
    ## 
    ## Residuals:
    ##    Min     1Q Median     3Q    Max 
    ## -41.55  -7.00  -0.02   6.93  43.46 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)   0.0106     0.1026     0.1     0.92    
    ## E             2.8584     0.0761    37.6   <2e-16 ***
    ## VE            0.9857     0.0761    12.9   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 10.3 on 9997 degrees of freedom
    ## Multiple R-squared:  0.391,  Adjusted R-squared:  0.391 
    ## F-statistic: 3.21e+03 on 2 and 9997 DF,  p-value: <2e-16
    
    confint(mod.EVE)
    
    ##               2.5 % 97.5 %
    ## (Intercept) -0.1905 0.2117
    ## E            2.7093 3.0075
    ## VE           0.8364 1.1350
    

    Estimates of both E and VE are biased. But coefficient of E is less biased than scenario 1. Reason for biasness of coefficient of VE is unknown to me.

    Scenario 5 (E, VE, VO, C as covariates) (Complete model)

    mod.comp <- lm(O ~ E + C + VO + VE, data = df)
    summary(mod.comp)
    
    ## 
    ## Call:
    ## lm(formula = O ~ E + C + VO + VE, data = df)
    ## 
    ## Residuals:
    ##    Min     1Q Median     3Q    Max 
    ## -3.434 -0.677 -0.014  0.678  3.616 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)  0.01297    0.01000    1.30     0.19    
    ## E            1.98794    0.00826  240.62   <2e-16 ***
    ## C            1.99993    0.00830  240.81   <2e-16 ***
    ## VO           2.00033    0.00202  992.51   <2e-16 ***
    ## VE           0.01214    0.00831    1.46     0.14    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 1 on 9995 degrees of freedom
    ## Multiple R-squared:  0.994,  Adjusted R-squared:  0.994 
    ## F-statistic: 4.29e+05 on 4 and 9995 DF,  p-value: <2e-16
    
    confint(mod.comp)
    
    ##                 2.5 %  97.5 %
    ## (Intercept) -0.006635 0.03258
    ## E            1.971745 2.00413
    ## C            1.983651 2.01621
    ## VO           1.996381 2.00428
    ## VE          -0.004159 0.02844
    

    Unbiased coefficients of all the covariates.

    Scenario 6 (E, VE, C as covariates)

    mod.EVEC <- lm(O ~ E + VE + C, data = df)
    summary(mod.EVEC)
    
    ## 
    ## Call:
    ## lm(formula = O ~ E + VE + C, data = df)
    ## 
    ## Residuals:
    ##    Min     1Q Median     3Q    Max 
    ## -36.18  -6.74  -0.08   6.71  43.21 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)  -0.0185     0.0998   -0.19     0.85    
    ## E             1.9892     0.0824   24.13   <2e-16 ***
    ## VE            0.0920     0.0830    1.11     0.27    
    ## C             1.9817     0.0829   23.92   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 9.98 on 9996 degrees of freedom
    ## Multiple R-squared:  0.424,  Adjusted R-squared:  0.424 
    ## F-statistic: 2.45e+03 on 3 and 9996 DF,  p-value: <2e-16
    
    confint(mod.EVEC)
    
    ##                2.5 % 97.5 %
    ## (Intercept) -0.21411 0.1771
    ## E            1.82758 2.1507
    ## VE          -0.07061 0.2546
    ## C            1.81930 2.1441
    

    Addition of C has removed the biasedness from the coefficients of E and VE (see Scenario 4).

    Scenario 7 (E, VO and C as covariates)

    mod.EVOC <- lm(O ~ E + VO + C, data = df)
    summary(mod.EVOC)
    
    ## 
    ## Call:
    ## lm(formula = O ~ E + VO + C, data = df)
    ## 
    ## Residuals:
    ##    Min     1Q Median     3Q    Max 
    ## -3.427 -0.676 -0.014  0.680  3.641 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)  0.01285    0.01000    1.29      0.2    
    ## E            1.99323    0.00742  268.50   <2e-16 ***
    ## VO           2.00036    0.00202  992.52   <2e-16 ***
    ## C            2.00539    0.00741  270.45   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 1 on 9996 degrees of freedom
    ## Multiple R-squared:  0.994,  Adjusted R-squared:  0.994 
    ## F-statistic: 5.72e+05 on 3 and 9996 DF,  p-value: <2e-16
    
    confint(mod.EVOC)
    
    ##                 2.5 %  97.5 %
    ## (Intercept) -0.006751 0.03246
    ## E            1.978682 2.00779
    ## VO           1.996410 2.00431
    ## C            1.990857 2.01993
    

    Performance of this model is almost the same as Scenario 5

    Performance of different scenarios

    Models Complete (E, C, VO, VE) E, C, VE E, C, VO E, C E, VE E, VO E only
    E (2) 1.9879, 0.0083 1.9892, 0.0824 1.9932, 0.0074 2.0293, 0.0741 2.8584, 0.0761 3.5918, 0.013 3.642, 0.0465
    C (2) 1.9999, 0.0083 1.9817, 0.0829 2.0054, 0.0074 2.0231, 0.074 - - -
    VE (0) 0.0121, 0.0083 0.092, 0.083 - - 0.9857, 0.0761 - -
    VO (2) 2.0003, 0.002 - 2.0004, 0.002 - - 2.0017, 0.0058 -

    E (2): means covariate E and its actual value.

    –,–: coefficient, standard error.

    Conclusions

    Models with C, will have unbiased estimate of E. Models with correlated variables (E, C, VE) has larger SE due to near collinearity.

    4

    View comments

Subscribe
Subscribe
Blog Archive
My Blog List
My Blog List
About Me
About Me
I am a clinical hematologist practising in India with data analysis and R as my passions.
Loading
Dynamic Views theme. Powered by Blogger.