library(geepack) # for estimating parameters using GEEs
library(lme4) # for fitting glmms
library(ggplot2) # for plotting
20 Generalized Estimating Equations (GEE)
Learning Objectives
- Learn how to model population-level response patterns for count and binary data using Generalized Estimating Equations (GEEs)
20.1 R packages
We begin by loading a few packages upfront:
20.2 Introduction to Generalized Estimating Equations
In the last section, we saw how the parameters in typical GLMMs have a subject-specific interpretation. We also saw how we could estimate mean-response patterns at the population level by integrating over the distribution of the random effects. Lastly, we saw how we could use the marginal_coef
function in the GLMMadaptive
package to estimate parameters describing marginal (i.e., population-average) mean-response patterns from fitted GLMMS.
There are times, however, when we may be interested in directly modeling how the population mean changes as a function of our predictor variables. For Normally distributed response variables, we can use the gls
function in the nlme
package, which allows us to specify various correlation structures for our residuals (Section 19.3.2). Yet, as we highlighted in Section 19.4, multivariate distributions for count and binary data do not have separate mean and variance parameters, making it challenging to fit marginal models to binary or count data in the case where we have repeated measures. Simply put, there is no multivariate Poisson or Binomial distribution that contains separate parameters for describing correlation among observations from the same cluster, so we cannot take a simple marginal modeling approach as we saw for Normally distributed data (see Section 18.14 and Section 19.3.2).
Generalized Estimating Equations (GEEs) offer an alternative approach to modeling correlated binary or count data in which one specifies a model for the first two moments (the mean and variance) of the data but not the full marginal distribution (Zeger, Liang, and Albert 1988; Hilbe and Hardin 2008; Fieberg et al. 2009). Parameters are estimated by solving:
Importantly, estimates of the regression parameters will be (asymptotically) unbiased even when the working correlation and variance model is mis-specified, provided that the model for the marginal mean is correct and the mechanisms responsible for any missing data or sample-size imbalance can be assumed to be completely random (Zeger, Liang, and Albert 1988); proper specification of
20.3 Motivating GEEs
You may be wondering where Equation 20.1 comes from. In Section 10.11 we saw that the maximum likelihood estimator for the linear model is equivalent to the least-squares estimator, which finds the values of
Taking the derivative with respect to
where
To demonstrate this fact, we will derive this result for the logistic regression model, which assumes:
First, note that for the logistic regression model:
Furthermore:
Thus,
We can show that we end up in the same place if we take the derivative of the log-likelihood with respect to
Thus, the log-likelihood is given by:
Taking the derivative of the log-likelihood with respect to
which is the same as Equation 20.3.
If you found this exercise to be fun, you might follow up by proving that the maximum likelihood estimator for Poisson regression is also equivalent to Equation 20.2.
What is the point of all this? The equations used to estimate parameters in linear models and generalized linear models take a similar form. Generalized estimating equations (Equation 20.1) extend this basic idea to situations where we have multiple observations from the same cluster. We just replace
20.4 GEE applied to our previous simulation examples
There are a few different packages for fitting GEEs in R. We will explore the geeglm
function in the geepack
package (Yan 2002; Halekoh, Højsgaard, and Yan 2006). Let’s revisit the data from our hockey skaters in Chapter (glmm). Similar to random intercept models, let’s assume the observations from the same skater are all exchangeable
, giving a compound symmetry correlation structure:
Further, as with logistic regression, we will assume:
We specify this assumption using the family = binomial()
argument. Note, however, that we are not assuming our data come from a binomial distribution. We only adopt the mean and variance assumptions from logistic regression. Furthermore, in contrast to GLMMs with a random intercept (from Chapter 19). GEEs can be used to model the population mean as a logit-linear function of predictors and regression parameters. To make sure we account for the repeated measures aspect of the data, we have to specify a clustering variable, individual
in this case, using id = individual
.
<- geeglm(Y ~ practice.num, family = binomial(),
fit.gee.e id = individual, corstr = "exchangeable",
data = skatedat)
summary(fit.gee.e)
Call:
geeglm(formula = Y ~ practice.num, family = binomial(), data = skatedat,
id = individual, corstr = "exchangeable")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) -3.4833 0.2708 165.4 <2e-16 ***
practice.num 0.5960 0.0421 200.4 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation structure = exchangeable
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 0.979 0.1663
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha 0.2619 0.06906
Number of clusters: 100 Maximum cluster size: 10
If we compare the coefficients from the GLMM with a random intercept to the coefficients from the Generalized Estimating Equations approach, we find that the latter estimates are closer to 0.
fixef(glmermod)
(Intercept) practice.num
-5.949 0.997
This result was to be expected since the GLMM estimates subject-specific parameters whereas we are directly modeling the population-averaged response pattern using GEEs. We also see that unlike the GLMM, the GEE does a nice job of estimating the proportion of skaters in the population that can skate backwards at each practice (Figure 20.1).
<-rbind(mdat5,
mdat6data.frame(practice.num = practice.num,
p = plogis(marginal_beta$betas[1]+practice.num * marginal_beta$betas[2]),
type = rep("GEE", 10)))
# Plot
ggplot(mdat6, aes(practice.num, p, col=type)) +
geom_line() + geom_point() +
xlab("Number of practices") +
ylab("Proportion of skaters that can skate backwards")
marginal_coef
function (olive green) from the GLMMadaptive
package, and using Generalized Estimating Equations (red) versus the GLMM response curve for a typical individual (formed by setting Similarly, a GEE with an exchangeable working correlation structure does a nice job of estimating the population-mean-response pattern for our simulated birders (Figure 20.2:
<- geeglm(Y ~ lobs.time, family = poisson(),
geebird id = individual, corstr = "exchangeable", data = birdobs)
# Summarize the mean number of birds seen as a function of lobs.time at the population-level
$gee <- exp(coef(geebird)[1] + mdatb$lobs.time*coef(geebird)[2])
mdatb
# plot
ggplot(mdatb, aes(lobs.time, meanY)) + geom_point() +
geom_line(aes(lobs.time, gee), color="black") +
xlab("log(Observation time)") + ylab("Number of birds seen")

20.5 GEEs versus GLMMs
Given that GLMMs and GEEs estimate parameters that have different interpretations, much of the literature that aims to provide guidance on choosing between these methods focuses on the estimation target of interest – i.e., whether one is interested in describing subject-specific response patterns (GLMM) or population-average response patterns (GEE). Yet, we saw that it is possible to estimate population-average response patterns from GLMMs if we integrate over the distribution of the random effects, so the choice of estimation target may not be a deciding factor when choosing between GEEs and GLMMs. In this short section, we further contrast the two approaches in hopes of providing additional guidance to practitioners.
Complexity: GLMMS require approximating an integral in the likelihood before parameters can be estimated (Section 19.7). This makes GLMMs much more challenging to fit than GEEs. On the other hand, GLMMs can describe much more complex hierarchical data structures (e.g., clustering due to multiple grouping variables – such as individual and year), whereas GEEs typically only allow for a single clustering variable.
Robustness: GLMMs require distributional assumptions (e.g., Normally distributed random effects, a distribution for
conditional on the random effects). GEEs, by contrast, only make assumptions about the mean and variance/covariance of the data, and estimates of regression parameters should be robust to incorrect assumptions about the variance/covariance structure. On the other hand, GEEs have more strict requirements when it comes to missing data. Whereas standard applications of GEE require data to be missing completely at random, likelihood approaches are still appropriate if missingness depends only on the individual’s observed data (i.e. their covariates and their other response values) and not on data that were not observed (e.g. the missing response) (Carrière and Bouyer 2002; Little and Rubin 2019).Tools for model selection and inference: since GLMMs are likelihood based, we have a number of methods that can be used to compare and select an appropriate model (e.g., AIC, likelihood ratio tests). We can also assess goodness-of-fit using simulation-based methods. Furthermore, we can simulate future responses and estimate prediction intervals when using GLMMs. Since GEEs only make assumptions about the mean and variance of the data, we cannot use likelihood-based methods for comparing models. In addition, GEEs typically require large numbers of clusters so that a Normal approximation is appropriate for the sampling distribution of regression parameter estimators.