library(modelsummary) # for tables
library(kableExtra) # for tables
7 Causal Inference
In this section, I will provide a very brief introduction to some of the concepts and tools used to evaluate evidence for causal effects from observational data.
Learning objectives
- Gain a deeper appreciation for why correlation (or association) is not the same as causation
- Discover basic rules that allow one to determine dependencies (correlations) among variables from an assumed causal network
- Understand how causal networks can be used to inform the choice of variables to include in a regression model
Credit: this section was heavily influenced by Daniel Kaplan’s chapter on Causation in his Statistics: A Fresh Approach (Kaplan 2009) and the excellent introduction to causal inference (Dablander 2020) from which many of the examples are drawn.
7.1 R Packages
We begin by loading a few packages upfront:
In addition, we will explore functions for evaluating conditional (in)dependencies in a Directed Acyclial Graph (DAG) using the ggm
package (Marchetti, Drton, and Sadeghi 2020).
7.2 Introduction to causal inference
Most of the methods covered in introductory and even advanced statistics courses focus on methods for quantifying associations. For example, the regression models covered in this course quantify linear and non-linear associations between explanatory (
What will happen if we intervene in a system or manipulate one or more variables in some way (e.g., will we increase our longevity if we take a daily vitamin? Will we cause more businesses to leave the state if we increase taxes? How will a vaccine mandate influence disease transmission, unemployment rates, etc? Will we be able to reduce deer herds if we require hunters to shoot a female deer before they can harvest a male deer (sometimes referred to as an earn-a-buck management strategy1)?
Scenarios that we will never be able to observe (e.g., would George Floyd still be alive today had he been white? Would your female friend or neighbor have been promoted if she had been male? Would there have been fewer bird collisions with the Vikings stadium if it had been built differently?); These types of questions involve counterfactuals - and require considering what might have happened in an alternative world where the underlying conditions were different (e.g., where George Floyd was white, your neighbor was a male, or if the Viking stadium had been built differently).
Importantly, we can’t answer these questions using information on associations alone. For example, if we observe that
When you took a introductory statistics class, you probably heard your instructor say at least once, “correlation is not causation.” Furthermore, you probably learned about some of the challenges associated with inferring causation from observational data and the benefits of performing experiments whenever possible to establish causal linkages. A key challenge with inferring causation from observational data is that there are almost always confounding variables (variables correlated with both the predictor of interest and the response variable) that could offer an alternative explanation for why the predictor and response variables are correlated. One of my favorite examples of confounding is from Lock et al. (2020) (Figure 7.1), simply because it lets me talk about my father-in-law and the fact that he has collected a large number of old televisions. If you look at the average life expectancy versus the number of televisions (TVs) per person in different countries, there is a clear positive correlation (Figure 7.1). Does that mean my father-in-law will live forever thanks to his stockpiles of TVs? Should we all go out
Evolution has equipped human minds with the power to explain patterns in nature really well, and we can easily jump to causal conclusions from correlations. Consider an observational study reporting that individuals taking a daily vitamin had longer lifespans. At first, this may seem to provide strong evidence for the protective benefits of a daily vitamin. Yet, there are many potential explanations for this observation – individuals that take a daily vitamin may be more risk averse, more worried about their health, eat better, exercise more, have more money, etc – or, perhaps a daily vitamin is actually beneficial to one’s health. We can try to control or adjust for some of these other factors when we have the data on them, but in observational studies there will almost always be unmeasured variables that could play an important causal role in the relationships we observe. The reason experiments are so useful for establishing causality is that by randomly assigning individuals to treatment groups (e.g., to either take a vitamin or a placebo), we break any association between the treatment variable and possible confounders. Thus, if we see a difference between treatment groups, and our sample size is large, we can be much more assured that the difference is due to the treatment and not some other confounding variable.
One might walk away from an introductory statistics course thinking that the only way to establish causality is through experimentation. Yet, tools for inferring causality from observational data have also been around for a long time (e.g., Sewell Wright invented path analysis in the early 1920’s; Wright 1921). Furthermore, computer scientists, econometricians, and statisticians have made a lot of progress over the past few decades in developing new theory and methods for inferring causality from observational data. Most of these approaches require assumptions about how the world works, encoded in a causal diagram or directed acyclical graph (DAG). In this section, I will briefly introduce DAGs and describe how they can be used to inform the choice of an appropriate regression model for estimating causal effects. Yet, this section will barely scratch the surface when it comes to causal inference. More in depth treatments can be found, e.g., in J. Pearl (2000), Glymour, Pearl, and Jewell (2016), Judea Pearl, Glymour, and Jewell (2016), and Judea Pearl and Mackenzie (2018).
7.3 Directed acyclical graphs and conditional independencies
Directed acyclical graphs (DAGs) represent causal pathways connecting nodes (either observed or unobserved variables) in a system, and thus, represent our understanding of how we think the world works. Connections between variables are directed, meaning that arrows are drawn so that one can distinguish cause from effect (cause
As we will see, DAGs are central to understanding which predictor variables should be included in a regression model when attempting to estimate causal effects using observational data. At the most basic level, there are three types of causal connections between variables that need to be considered (arrows, below, indicate the direction of causal effects); these connections will help determine whether variables are dependent (i.e., associated) or independent after conditioning on one or more other variables in the system (Judea Pearl 1995; J. Pearl 2000):
chain : , with referred to as a mediator variable on the causal path from to- fork:
, with referred to as a common cause - inverted fork:
, with referred to as a collider variable
Assume for now that
In the case of a chain,
Similarly, in the case of a fork or common cause,
One might get the impression, based on the above example and simple discussions of confounding variables in introductory statistics courses, that it is always best to include or adjust for other variables when fitting regression models. However, we can also create a spurious correlation by conditioning on a collider variable in an inverted fork. Specifically, if
In summary, there are multiple DAGs that we could consider as representations of the causal connections between the variables
7.3.1 Collider bias
Although the bias caused by confounding variables (e.g., a common cause) is well known, many readers may be surprised to hear that a spurious correlation can be created when we adjust for a variable. Thus, we will demonstrate this issue with a simple simulation example. Consider a survey of students at the University of Minnesota, where students are asked whether they are taking one or more classes on the St. Paul campus (
- We generate 10,000 values of
from a uniform distribution between 0 and 10. - We generate 10,000 values of
, independent of , using a Poisson distribution with mean, , equal to 4. - We determine the probability that each student is taking a class on the St. Paul campus,
, using:
When
# Set seed of random number generator
set.seed(1040)
# number of students
<- 10000
n # Generate X = interest in nutrition and food science
<- runif(n, 0, 10)
x # Generate number of days fishing
<- rpois(n, lambda=4)
y # Generate whether students are taking classes on St. Paul campus
<- exp(-5 + 2*x + 2*y)/(1+exp(-5 + 2*x + 2*y))
p <- rbinom(n, 1, prob=p) z
We then explore marginal and conditional relationships between
<- lm(y ~ x)
mod1 <- lm(y ~ x + z) mod2
Importantly, collider bias can also occur if we restrict the study population using information in the collider variable,
<-data.frame(x=x, y=y, z=z)
collider.datsummary(lm(y ~ x, data=subset(collider.dat, z==1)))
Call:
lm(formula = y ~ x, data = subset(collider.dat, z == 1))
Residuals:
Min 1Q Median 3Q Max
-4.1936 -1.2010 -0.1128 1.0351 9.1015
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.251386 0.041596 102.208 < 2e-16 ***
x -0.035381 0.007101 -4.982 6.39e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.985 on 9701 degrees of freedom
Multiple R-squared: 0.002553, Adjusted R-squared: 0.00245
F-statistic: 24.83 on 1 and 9701 DF, p-value: 6.386e-07
In essence, an interest in nutrition or an interest in fishing might lead someone to be represented in the study population of students on the St. Paul campus. Those students that end up in St. Paul due to their high interest in fishing likely have an average (or slightly-below average) level of interest in nutrition (assuming that interest in fishing and nutrition are independent in the full population of students). Similarly, students studying nutrition likely have an average to slightly below average level of interest in fishing. When we combine these two sets of students (students studying nutrition and students studying fisheries, wildlife, or conservation biology), we end up with a negative association between interest in fishing and interest in nutrition in the study population.
Similar concerns have been raised when analyzing data from hospitalized patients. As one example, Sackett (1979) found an association between locomotor disease and respiratory disease in hospitalized patients but not in the larger population. This result could be explained by a DAG in which hospitalization serves as a collider variable (Figure 7.5).
The importance of collider bias has also been recently highlighted in the context of occupancy models (MacKenzie et al. 2017), which are widely used in ecology. Using a simulation study, Stewart et al. (2023) showed that estimates of effect sizes associated with covariates influencing occupancy probabilities were biased when collider variables were considered for inclusion and information criterion (AIC and BIC; see Chapter 8) were used to select an appropriate model.
7.4 d-separation
We can use these same 3 basic rules to help determine, in larger causal networks, whether variables
chain : : if we add electricity to , it will flow to . Thus, the pathway is open unless we include which will “block” the path.- fork:
: if we add electricity to , it will flow to both and . Thus, the pathway between and is open unless we include (i.e., condition on) which will block this path. - inverted fork:
: if we add electricity to either or it will flow to and get stuck. If we add electricity to it will remain there. Thus, there is no way to connect and unless we condition on , which will open the pathway. It turns out, that conditioning on any of the descendants of will also open this pathway (Glymour, Pearl, and Jewell 2016). A descendant of is any variable that has an arrow (or set of arrows) that lead from into that variable.
Let’s now consider a more complicated causal network Figure 7.6.
To determine if two variables are dependent after conditioning on one or more variables, we will use the following steps:
- Write down all paths connecting the two variables.
- Determine if any of the paths are open/correlating. If any of the paths are open, then the two variables will be dependent. If all of the paths are blocked, then the two variables will be (conditionally) independent. When this occurs, we say that the variables are d-separated by the set of conditioning variables.
Let’s consider variables
The first path is a chain and is correlating (unless we condition on
(due to the first path being open) (conditioning on will close this open path) (conditioning on will close the first path, but conditioning on will open up the second path) (conditioning on will close the first path, but conditioning on , which is a descendant of the collider , will open up the second path)
We can use functions in the ggm
package to confirm these results (Marchetti, Drton, and Sadeghi 2020). We begin by constructing the DAG using the DAG
function to capture all arrows flowing into our variables:
library(ggm)
<-DAG(W ~ X + Y,
dag1~ X,
Z ~ Z,
Y ~ Y,
V ~ W) U
We can then use the dSep
function to test whether two variables first
and second
arguments of the dSep
function) are d-separated after conditioning on one or more variables (cond
argument). Here, we confirm the 4 results from before:
dSep(dag1, first = "X", second= "Y", cond=NULL)
[1] FALSE
dSep(dag1, first = "X", second= "Y", cond="Z")
[1] TRUE
dSep(dag1, first = "X", second= "Y", cond=c("Z","W"))
[1] FALSE
dSep(dag1, first = "X", second= "Y", cond=c("Z","U"))
[1] FALSE
7.5 Estimating causal effects (direct, indirect, and total effects)
As was mentioned in the introduction to this section, when we want to intervene in a system, we often find that there are both direct and indirect effects on other variables. Formal methods have been developed (e.g., do calculus for calculating effects of interventions; Judea Pearl, Glymour, and Jewell 2016; Dablander 2020). Although we will not go into detail about these methods here, it is useful to recognize that when we intervene in a system and set
We can also consider how we can use DAGs to determine whether to include or exclude a variable from a regression model depending on whether we are interested in estimating a direct effect or total (sum of direct and indirect) effect. Let’s start by considering two examples from Judea Pearl, Glymour, and Jewell (2016) and Dablander (2020) (Figure 7.8). Panel A depicts a situation in which a treatment (
Let’s start by writing down all paths that connect
When fitting a regression model, we want the coefficient for
Now, let’s consider a second example where the treatment has both a direct effect on the response and an indirect effect through a variable
Again, both of these paths are correlating. If we fit a model that includes only
More generally, when we want to estimate a causal effect of
- Block all spurious (non-causal) paths between
and - Leave all directed paths from
to unblocked (i.e., do not include mediator variables on the path between and ), and - Make sure not to create spurious correlations by including colliders that connect
to
Consider again Figure 7.6 (shown again, below). If we want to calculate the causal effect of
The first path is our directed path connecting
7.6 Some (summary) comments
We have seen how simple causal diagrams can help with understanding if and when we should include variables in regression models. Hopefully, you will keep these ideas in mind when learning about other less thoughtful, data-driven methods for choosing a model (Chapter 8). In particular, it is important to recognize that a “best fitting” model may not be the most appropriate one for addressing your particular research question, and in fact, it can be misleading (Luque-Fernandez et al. 2019; Stewart et al. 2023; Arif and MacNeil 2022; Addicott et al. 2022).
One challenge with implementing causal inference methods is they rely heavily on assumptions (i.e., an assumed graph capturing causal relationships between variables in the system). And although it is sometimes possible to work backwards, using the set of observed statistical independencies in the data to suggest or rule out possible causal models, multiple models can lead to the same set of statistical independencies (Shipley 2002). In these cases, experimentation can prove critical for distinguishing between competing hypotheses.
Lastly, it is important to consider multiple lines of evidence when evaluating the strength of evidence for causal effects. In that spirit, I end this section with Sir Austin Bradford Hill’s5 suggested criteria needed to establish likely causation (the list below is taken verbatim from https://bigdata-madesimple.com/how-to-tell-if-correlation-implies-causation/):
- Strength: A relationship is more likely to be causal if the correlation coefficient is large and statistically significant.
- Consistency: A relationship is more likely to be causal if it can be replicated.
- Specificity: A relationship is more likely to be causal if there is no other likely explanation.
- Temporality: A relationship is more likely to be causal if the effect always occurs after the cause.
- Gradient: A relationship is more likely to be causal if a greater exposure to the suspected cause leads to a greater effect.
- Plausibility: A relationship is more likely to be causal if there is a plausible mechanism between the cause and the effect
- Coherence: A relationship is more likely to be causal if it is compatible with related facts and theories.
- Experiment: A relationship is more likely to be causal if it can be verified experimentally.
- Analogy: A relationship is more likely to be causal if there are proven relationships between similar causes and effects.
7.7 References
https://www.realtree.com/deer-hunting/articles/earn-a-buck-was-it-the-greatest-deer-management-tool↩︎
In most years, my students discover Tyler Vigen’s web site (http://tylervigen.com/spurious-correlations), which offers many other silly examples of ridiculously strong correlations over time between variables that do not themselves share a causal relationship.↩︎
Marginal here refers to the unconditional relationship between
and rather than the strength of this relationship.↩︎We will learn more about the distributions used to simulate data when we get to Chapter 9.↩︎
Sir Austin Bradford Hill was a famous British medical statistician.↩︎