library(kableExtra) # for creating tables
options(kableExtra.html.bsTable = T)
library(dplyr) # for data wrangling
library(ggplot2) # for plotting
library(patchwork) # for creating multi-panel plots10 Maximum likelihood
Learning objectives
Understand how to use maximum likelihood to estimate parameters in statistical models.
Understand how to create large-sample confidence intervals for parameters estimated using maximum likelihood.
Be able to construct large-sample confidence intervals for functions of parameters using the delta method.
Be able to conduct likelihood ratio tests for nested models and understand that profile-likelihood intervals can be formed by inverting the likelihood ratio test.
Credit: much of this material in this section was originally developed by Jack Weiss and presented online as part of his Statistical Ecology courses at the University of North Carolina.
10.1 R packages
We begin by loading a few packages upfront:
In addition, we will use data and functions from the following packages:
Data4Ecologistsforbeargrowthandslugsdata setsemdbook(B. Bolker 2023) for thedeltavarfunction used to calculate standard errors via the delta method.
10.2 Parameter estimation
In the previous section, we learned about a number of statistical distributions, described by a small set of parameters. We might now ask:
- How do we determine appropriate values of the parameters?
- How do we incorporate the effects of covariates on these parameters and thus, the distributions of our data?
- How do we calculate confidence intervals or conduct hypotheses tests associated with these parameters?
Although there are alternative approaches (e.g., least squares, methods of moments), we will focus on two primary methods for estimating parameters:
- Maximum likelihood estimators
- Bayesian inference
Most statistical software packages use maximum likelihood to estimate parameters, including for example, the glm function in R for fitting logistic and Poisson regression models that will be the subject of later Sections of this book. Although users do not necessarily need to know the details of how maximum likelihood works in order to apply these methods, this knowledge will be helpful for understanding the inferential framework associated with commonly used statistical models fit in a frequentist framework (e.g., SEs and confidence intervals usually rely on an assumption that the sampling distribution is Normal, which will be reasonable for parameters estimated using maximum likelihood as long as our sample sizes are large or the data are Normally distributed). Typically, numerical analysis methods are needed to estimate parameters using maximum likelihood. You will be introduced to these methods, which may seem like a bit of a diversion. Yet, this knowledge will allow you to formulate and fit custom models that may otherwise not be available in current software. Lastly, likelihoods are also a key component of Bayesian inferential techniques. Thus, this section will help to prepare you for your journey into Bayesian inference.
To help motivate this section, consider the weight-at-age relationship for female black bears monitored in Minnesota depicted in Figure 10.1, constructed using the code below.
library(Data4Ecologists)
theme_set(theme_bw())
data(beargrowth)
beargrowth <- beargrowth %>% filter(sex=="Female")
ggplot(beargrowth, aes(age, wtkg)) +
geom_point() + geom_smooth() + ylab("Weight (kg)") + xlab("Age (years)") We see that the pattern is non-linear and also that the variance of the observations increases with the age of the bears. We could try to model these data using polynomials or splines (see Chapter 4) combined with a variance model from Chapter 5. However, there are also many non-linear functions that have “built-in” asymptotes, and these are frequently used to model growth rates. Here, we will consider the von Bertalanffy growth function (Beverton and Holt 2012):
An advantage of using a growth model like the von Bertalanffy growth curve versus a phenomenological model (e.g., using polynomials or splines) is that the model’s parameters often have useful biological interpretations. For example,
Although there are R packages (e.g., FSA; Ogle et al. 2021) that allow one to fit the von Bertalanffy growth curve to data, by writing our own code and using built-in optimizers, we will be able to tailor our model to specific features of our data (e.g., we can relax the assumption that the variance about the mean growth curve is constant). We could relax the constant variance assumption using a GLS-style model or by using an alternative probability distribution (e.g., log-normal) that allows for non-constant variance. To accomplish this goal, we will need to be able to write some of our own code and also use built-in optimizers in R. Keep this example in mind when we go over numerical methods for estimating parameters using maximum likelihood.
10.3 Introductory example: Estimating slug densities
Crawley (2012) describe a simple data set consisting of counts of the number of slugs found beneath 40 tiles in each of two permanent grasslands. The data are contained within the Data4Ecologists library.
library(Data4Ecologists)
data(slugs)
str(slugs)'data.frame': 80 obs. of 2 variables:
$ slugs: int 3 0 0 0 0 1 0 3 0 0 ...
$ field: chr "Nursery" "Nursery" "Nursery" "Nursery" ...
Let’s begin by looking at the distribution of slug counts under the tiles in these two fields (Figure 10.2).
ggplot(slugs, aes(slugs, fill=field))+
geom_bar(position=position_dodge())+
theme(text = element_text(size=20))+
scale_fill_manual(values=c("red", "blue"))+
scale_x_continuous(breaks=seq(0,11,1))We see that there are more slugs, on average, under the tiles located at the Rookery site than the Nursery site. Also note that the data take on only integer values
What if we wanted to conduct a formal hypothesis test to see whether the mean number of slugs per tile differed between the two sites? I.e., what if we wanted to test the Null Hypothesis,
We could consider using a two-sample t-test for a difference in means. In this case, what would we have to assume? At a minimum, we would need to assume that the observations are:
- Independent
- Normally distributed or that the sample size is “large enough” for the Central Limit Theorem (CLT) to apply
A common guideline is that
Let’s begin by assuming the counts at each site follow a Poisson distribution. We could then represent our null and alternative hypotheses using the models, below:
Null model:
Alternative model:
OK, great - but, how do we estimate
10.4 Probability to the likelihood
To estimate parameters using Maximum Likelihood, we begin by writing down a probability statement reflecting an assumed data-generating model. If all of our observations are independent, then:
Let’s build the likelihood of the slug data using the null data-generating model,
Thus, we have that:
We can write down similar expressions for all of the other observations in the data set and multiply them together to give us the likelihood:
Or, in more general notation, the likelihood for
For discrete distributions, the likelihood gives us the probability of obtaining the observed data, given a set of parameters (in this case, Head, Tail, Tail, Head, Head, Head, Tail, Tail, Head, Tail, Tail, Head. The probability associated with this specific set of coin flips is
10.5 Maximizing the likelihood
The likelihood of our slug data under the null data-generating model, which assumes the observations are independent realizations from a common Poisson distribution with parameter
Using the result that
As previously mentioned, often the probability of the observed data (i.e., the likelihood) will be extremely small. This can create lots of numerical challenges when trying to find the parameters that maximize the likelihood. For this, and for other reasons3, it is more common to work with the log-likelihood. Because the relationship between the likelihood and the log-likelihood is monotonic (i.e., increases in the likelihood always coincide with increases in the log-likelihood), the same value of
The log-likelihood for the Poisson model is thus given by5:
We will use this expression to find the value of the
How can we find the value of
- calculus
- using a grid search
- using functions in R or other software that rely on numerical methods to find an (approximate) maximum or minimum of a function
10.5.1 Calculus
To find the value of
Putting things together, we have:
Setting this expression to 0 and solving gives us:
To make sure we have found a maximum (and not a minimum), we should ideally take the second derivative of the log-likelihood and evaluate it at
Thus, we have verified that the maximum likelihood estimator for
mean(slugs$slugs)[1] 1.775
10.5.2 Grid search
What if your likelihood is more complicated (e.g., it involves more than 1 parameter), you do not have access to the internet for a derivative calculator, or, more likely, there is no analytical solution available when you set the derivatives with respect to the different parameters equal to 0?7 In these cases, you could find an approximate solution using numerical methods, e.g., by using a grid search. Usually, I have students explore this concept using Excel, where they can enter equations for calculating the probability mass function associated with each observation and then multiply these values together to get the likelihood (Equation 10.2). Then, we transition to R to see how we can accomplish the same task using functions.
Functions are incredibly useful for organizing code that needs to be repeatedly executed, while also allowing for flexibility depending on user specified options passed as arguments. We have been using functions throughout this course (e.g., lm, plot, etc). We will need to create our own function, which we will call Like.s (for likelihood slugs; you can of course use another name). This function will calculate the likelihood of the slug data under the null data-generating model for any specific value of
- Calculate
using the probability mass function of the Poisson distribution and a given value of for all observations in the data set. - Multiply these different probabilities together to give us the likelihood of the data for a particular value of
.
Recall that R has built in functions for working with statistical distributions (Chapter 9). For the Poisson distribution, we can use the dpois function to calculate
dpois(3, lambda=2)[1] 0.180447
To verify that this number is correct, we could calculate it using the formula for the probability mass function of the Poisson distribution:
exp(-2)*2^3/factorial(3)[1] 0.180447
So, we have accomplished our first objective. If, instead, we wanted to use a negative binomial distribution, we would use the dnbinom function, passing it parameter values for mu and size. Next, we need to multiply these probabilities together. R has a built in function called prod for taking the product of elements in a vector. For example, we can calculate
dpois(3,lambda=2)*dpois(2, lambda=2)[1] 0.0488417
prod(dpois(c(3,2), lambda=2))[1] 0.0488417
The latter approach will be easier to work with when we have a large (and possibly unknown) number of observations. Putting everything together, we write our likelihood function so that it includes 2 arguments, lambda (the parameter of the Poisson distribution) and dat (a data set of observations):
Like.s<-function(lambda, dat){
prod(dpois(dat, lambda))
}We can now evaluate the likelihood for a given value of
Like.s(lambda=2, dat=slugs$slugs)[1] 5.532103e-78
We see that this probability is really small, but again, what we are interested in is finding the value of
# create a grid of lambda values
lambda.test <- seq(0.1, 5, length=1000)
# Create matrix to store likelihood values
L.vals <- matrix(NA, 1000, 1)
for(i in 1:1000){
L.vals[i]<-Like.s(lambda=lambda.test[i], dat=slugs$slugs)
}Alternatively, we could use sapply in base R or map_dbl in the purrr package (Henry and Wickham 2020) to vectorize the calculations, which will speed things up (though the loop is really fast in this case). Both sapply and map_dbl will evaluate a function for all values in a vector or list and return the output as a new vector. To use sapply or map_dbl, we need to pass a vector of Like.s function as the second argument. We also need to pass the slug observations so that Like.s can calculate the likelihood for each value of sapply or map_dbl is given by:
sapply(lambda.test, Like.s, dat=slugs$slugs)
purrr::map_dbl(lambda.test, Like.s, dat=slugs$slugs)Let’s look at how the likelihood changes as we vary
plot(lambda.test, L.vals,
xlab = expression(lambda),
ylab = "Likelihood",
type = "l")We can find the value of value of which.max function in R8:
# lambda that maximizes the likelihood
lambda.test[which.max(L.vals)][1] 1.772573
This value is identical to the sample mean out to 2 decimal places (we could get a more accurate answer if we used a finer grid).
As previously mentioned, the likelihood values are really small for all values of dpois function, but we add the argument log = TRUE so that this function returns sum function) rather than taking their product. Instead of writing another loop, we will take advantage of the map_dbl function in the purrr library to calculate the log-likelihood values:
#log Likelihood
logL.s<-function(lambda, dat){
sum(dpois(dat,lambda, log = TRUE))
}
# evaluate at the same grid of lambda values
logL.vals <- purrr::map_dbl(lambda.test, logL.s, dat=slugs$slugs)
# find the value that maximizes the log L and plot
lambda.test[which.max(logL.vals)][1] 1.772573
We see that the same value of
10.5.3 Using numerical optimizers
Most software packages have built in functions that can find parameter values that either maximize or minimize a specified function. R has several functions available for this purpose, including optim, nlm, nlminb, among others. In this section, we will explore the use of optim, which can be used to find the parameters that minimize any function. It has the following arguments that we must supply:
par= a vector of “starting” parameters (first guesses).fn= a function to minimize. Importantly, the first argument of this function must correspond to the vector of parameters that we want to optimize (i.e., estimate).
These two arguments are expected to occur in this order with par first and fn second. But, as with any function in R, we can change the order of the arguments as long as we use their names. Consider, for example, dpois. Its first two arguments are x and lambda. Normally, I am lazy and just pass the arguments in this order without supplying their names. For example,
dpois(3, 5)[1] 0.1403739
We can, however, reverse the order if we use the argument names:
dpois(lambda = 5, x = 3)[1] 0.1403739
We can use the method argument to specify a particular optimization method (we will use method = "BFGS"). We can also pass other arguments to optim (e.g., dat, which is needed to calculate the value of the log-likelihood). Importantly, by default, optim tries to find a minimum. Thus, we create a new function, minus.logL.s that calculates the negative log-likelihood. The value that minimizes the negative log-likelihood will maximize the log-likelihood.
minus.logL.s <- function(lambda, dat){
-sum(dpois(dat,lambda, log = TRUE))
}Most numerical optimization routines are iterative. They begin with a starting value (or guess), and then use information about the function (and its derivatives) at that guess to derive the next value to try. This results in a sequence of iterative guesses that hopefully converge to the true minimum. If there is only 1 minimum (or maximum), most optimizers will work well. But, if there are multiple minima or maxima, then optimizers will often converge to the one closest to the starting value. Thus, it can be important to try different starting values to help determine if there are multiple minima. If you find that the results from using optim depend on the starting value, then you should be cautious, try multiple starting values, and select the one that results in the smallest negative log-likelihood. You might also consider trying alternative estimation approaches (e.g., Bayesian methods that rely on Monte Carlo Markov Chain [MCMC] sampling may be more robust; Lele, Dennis, and Lutscher 2007)
For the slug example, there is only 1 minimum (Figure 10.3, Figure 10.4), and therefore, most starting values for
mle.fit<-optim(par = 2, fn = minus.logL.s, method = "BFGS", dat = slugs$slugs) Warning in dpois(dat, lambda, log = TRUE): NaNs produced
Note that optim outputs a warning here about NaNs being produced. The parameter, optim this, and therefore, it is possible that it will try negative values when attempting to evaluate the log-likelihood. In this case, dpois will return an NaN, giving us the warning. Yet, if we inspect the output, we will see that it was able to continue and find the value of
mle.fit $par
[1] 1.775
$value
[1] 176.8383
$counts
function gradient
15 6
$convergence
[1] 0
$message
NULL
We see that optim returns:
par= the value of that minimizes the negative log-likelihood; this is the maximum likelihood estimate!value= the value of the function (i.e., the negative log-likelihood) at its minimum valuecounts= information about how hard it had to work to find the minimum. We see that it had to calculate the likelihood for 15 different values of and it calculated 6 times (functionandgradient, respectively)convergence= an indicator of whether the optimizer converged; a value of 0 suggests the optimizer was successful, whereas a value of 1 would indicate that it hit a limit based on the maximum number of iterations allowed (which can be changed from a default value of 100)
In summary, we have used 3 different methods for finding the value of
10.5.4 Fitting the alternative data-generating model (site-specific Poisson distributions)
Consider our alternative data-generating model:
We could estimate optim as it offers a more rigorous way of finding parameters that maximize the likelihood and will also facilitate estimation of parameter uncertainty, as we will see later. Using optim will require rewriting our negative log-likelihood function to allow for two parameters. We will also need to pass to the function the data that identifies from which field the observations come from so that it uses the correct
minus.logL.2a<-function(lambdas, dat){
-sum(dpois(dat$slugs[dat$field=="Nursery"], lambdas[1], log=TRUE))-
sum(dpois(dat$slugs[dat$field=="Rookery"], lambdas[2], log=TRUE))
}Earlier, we noted that we did not tell optim that our parameters must be > 0 (i.e., method = "L-BFGS-B" and supply a lower bound for both parameters via an extra argument, lower = c(0, 0). Alternatively, we could choose to specify our likelihood in terms of parameters estimated on the log-scale and then exponentiate these parameters to get
minus.logL.2b<-function(log.lambdas, dat){
-sum(dpois(dat$slugs[dat$field=="Nursery"], exp(log.lambdas[1]), log=TRUE))-
sum(dpois(dat$slugs[dat$field=="Rookery"], exp(log.lambdas[2]), log=TRUE))
}In this way, we estimate parameters that can take on any value between dpois will be > 0 because we exponentiate the optimized parameters using exp(). In fact, this is what R does when it fits the Poisson model using its built in glm function.
Let’s use optim with minus.logL.2b to estimate optim, say par = c(2,2)).
mle.fit.2lambdas <- optim(par = c(2, 2), fn = minus.logL.2b, dat = slugs, method = "BFGS")
mle.fit.2lambdas $par
[1] 0.2429461 0.8219799
$value
[1] 171.1275
$counts
function gradient
29 10
$convergence
[1] 0
$message
NULL
Remember, our parameters now reflect
exp(mle.fit.2lambdas$par)[1] 1.275 2.275
These are equivalent to the sample means in the two fields:
slugs %>% group_by(field) %>%
dplyr::summarize(mean(slugs))# A tibble: 2 × 2
field `mean(slugs)`
<chr> <dbl>
1 Nursery 1.27
2 Rookery 2.28
At this point, to test your understanding, you might consider the following two exercises:
Exercise: modify the likelihood function to fit a model assuming the slug counts were generated by a negative binomial distribution (ecological version; Section 9.10.6.2).
Exercise: modify the two parameter Poisson model so that it is parameterized using reference coding. Your estimates should match those below:
glm(slugs~field, data=slugs, family=poisson())
Call: glm(formula = slugs ~ field, family = poisson(), data = slugs)
Coefficients:
(Intercept) fieldRookery
0.2429 0.5790
Degrees of Freedom: 79 Total (i.e. Null); 78 Residual
Null Deviance: 224.9
Residual Deviance: 213.4 AIC: 346.3
10.6 Properties of maximum likelihood estimators
Let
- MLEs are consistent, meaning that as the sample size increases towards
, they will converge to the true parameter value. - MLEs are asymptotically unbiased, i.e.,
as . Maximum likelihood estimators are not, however, guaranteed to be unbiased in small samples. As one example, the maximum likelihood estimator of for Normally distributed data is given by , which has as its expected value, . The bias of the maximum likelihood estimator becomes negligible as sample sizes become large. In cases like this where you can derive, analytically, an expression for bias, you can form an unbiased estimator that may perform better in small samples. For example, you may be familiar with the unbiased estimator, . - MLEs will have the minimum variance among all estimators as
.
In addition, for large sample sizes, we can approximate the sampling distribution of maximum likelihood estimators using:
Before we can define the Information matrix, we will need to formally define a Hessian; the Hessian is a matrix of second-order partial derivatives of a function, in this case, derivatives of the log-likelihood function with respect to the parameters
If there is only 1 parameter in the likelihood, then the Hessian will be a scaler and equal to
The information matrix is defined in terms of the Hessian and takes one of two forms:
- The observed information matrix is the negative of the Hessian evaluated at the maximum likelihood estimate:
- The expected information matrix is given by the expected value of the negative of the Hessian evaluated at the maximum likelihood estimate:
Note that the Hessian describes the curvature of the log-likelihood surface – the higher the curvature, the more quickly the log-likelihood decreases as we move away from the maximum likelihood estimate (Figure 10.5). If the curvature is large, then the asymptotic variance of

Importantly, the Hessian is often used by numerical optimization methods (e.g., optim) to find parameter values that minimize an objective function (e.g., the negative log-likelihood). We can ask optim to return the Hessian by supplying the argument hessian = TRUE, which will be equivalent to the observed information matrix (since we defined our function as the negative of the log-likelihood, the minus sign will already be taken care of). We can then calculate the asymptotic variance-covariance matrix of
Let’s demonstrate with our slug example, where we fit the alternative data-generating model that assumed separate Poisson distributions for each field. We begin by asking optim to return the Hessian by supplying an extra argument, hessian = TRUE:
mle.fit.2lambdas <- optim(par = c(2, 2),
fn = minus.logL.2b,
dat = slugs,
method = "BFGS",
hessian = TRUE)
mle.fit.2lambdas $par
[1] 0.2429461 0.8219799
$value
[1] 171.1275
$counts
function gradient
29 10
$convergence
[1] 0
$message
NULL
$hessian
[,1] [,2]
[1,] 5.100001e+01 -1.421085e-08
[2,] -1.421085e-08 9.100001e+01
We can calculate asymptotic standard errors for our parameters by finding the inverse of the Hessian. To find the inverse of a matrix, solve(H). Thus, we can find the variance/covariance matrix associated with parameter estimates using:
(var.cov.lambda <- solve(mle.fit.2lambdas$hessian)) [,1] [,2]
[1,] 1.960784e-02 3.062023e-12
[2,] 3.062023e-12 1.098901e-02
The diagonal elements of this matrix correspond to our estimates of the variance of our parameters; the square root of the diagonals thus give us their standard errors9
sqrt(diag(var.cov.lambda))[1] 0.1400280 0.1048285
The off diagonals of our variance-covariance matrix tell us about how our parameters covary in their sampling distribution (i.e., do high values for
Lastly, we can compare our estimates and their standard errors to those that are returned from R’s glm function, which we will see in Chapter 15:
summary(glm(slugs ~ field -1, data = slugs, family = "poisson"))
Call:
glm(formula = slugs ~ field - 1, family = "poisson", data = slugs)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
fieldNursery 0.2429 0.1400 1.735 0.0827 .
fieldRookery 0.8220 0.1048 7.841 4.46e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 263.82 on 80 degrees of freedom
Residual deviance: 213.44 on 78 degrees of freedom
AIC: 346.26
Number of Fisher Scoring iterations: 6
We see that we have obtained exactly the same estimates and standard errors as reported by the glm function. Also, note that at the bottom of the summary output, we see Number of Fisher Scoring iterations: 6. This comparison highlights:
glmis fitting a model parameterized in terms of the log mean (i.e., log )glmis using numerical routines similar tooptimand also reports standard errors that rely on the asymptotic results discussed in the above section, namely, .
10.7 A more complicated example: Fitting a weight-at-age model
Let us now return to our motivating example involving a weight-at-age model for female black bears in Minnesota. We will use optim to fit the following model:
This allows us to model the mean using the von Bertalanffy growth function and model the variability about this mean using the variance function from Section 5.5.3.
We begin by constructing the likelihood function with the following parameters
logL<-function(pars, dat){
linf <- pars[1]
k <- pars[2]
t0 <- pars[3]
sig <- exp(pars[4])
theta <- pars[5]
mu<-linf*(1 - exp(-k*(dat$age - t0)))
vari<-sig^2*abs(mu)^(2*theta)
sigi<-sqrt(vari)
ll<- -sum(dnorm(dat$wtkg, mean = mu, sd = sigi, log = T))
return(ll)
} To get starting values, which we denote using a superscript
- The smooth curve asymptotes at around 100, so we use
tells us how quickly we approach this asymptote as bears age. If we set and solve for , this gives . Then, noting that appears to be at /2 = 50 at around 3 years of age, we set is the x-intercept (i.e., the Age at which ); an extrapolation of the curve to suggests that should be slightly less than 0, say- Most of the residuals when age is close to 0 are within approximately
kg. For Normally distributed data, we expect 95% of the observations to be within so we set - We set
= 1 (suggesting the variance increases linearly with the mean)
fitvb <- optim(par = c(100, 0.23, -0.1, 1.62, 1),
fn = logL,dat = beargrowth,
method = "BFGS", hessian = TRUE)
fitvb$par
[1] 87.1070613 0.2853005 0.1530582 -0.1744772 0.6607753
$value
[1] 2156.353
$counts
function gradient
108 25
$convergence
[1] 0
$message
NULL
$hessian
[,1] [,2] [,3] [,4] [,5]
[1,] 1.633524 200.5033 -31.11917 8.632616 33.2212
[2,] 200.503252 36487.7264 -7444.52556 1325.372227 4676.6273
[3,] -31.119171 -7444.5256 2101.48826 -271.183081 -841.5901
[4,] 8.632616 1325.3722 -271.18308 1138.010281 4383.5226
[5,] 33.221200 4676.6273 -841.59012 4383.522578 17355.9438
Let’s inspect the fit of our model by adding our predictions from the model to our data set (Figure 10.6).
pars<-data.frame(Linf = fitvb$par[1], K = fitvb$par[2], t0 = fitvb$par[3],
sigma = exp(fitvb$par[4]), theta = fitvb$par[5])
beargrowth <- beargrowth %>%
mutate(mu.hat = pars$Linf*(1 - exp(-pars$K*(age - pars$t0))))
ggplot(beargrowth, aes(age, wtkg)) +
geom_point() + geom_line(aes(age, mu.hat), col = "red", linewidth = 1) +
ylab("Weight (kg)") + xlab("Age (years)") +
theme_bw()Next, let’s create and inspect plots using standardized residuals (Figure 10.7). In particular, we can plot standardized residuals versus fitted values to evaluate whether the von Bertalanffy growth curve adequately describes how the mean weight varies with age. We can also plot the sqrt root of absolute values of these residuals versus fitted values to evaluate whether our variance model is appropriate.
beargrowth <- beargrowth %>%
mutate(sig.hats = pars$sigma^2*abs(mu.hat)^(2*pars$theta)) %>%
mutate(stdresids = (wtkg - mu.hat)/(sig.hats),
sqrt.abs.resids = sqrt(abs(stdresids)))
p1<-ggplot(beargrowth, aes(mu.hat, stdresids)) + geom_point() +
geom_hline(yintercept=0) + geom_smooth() + theme_bw()
p2<-ggplot(beargrowth, aes(mu.hat, sqrt.abs.resids)) + geom_point() +
geom_smooth() + theme_bw()
p1+p2`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The standardized residuals do not exhibit any trend as we move from left to right in the residual versus fitted value plot (left panel of Figure 10.7), but residuals of the smallest bears appear to be slightly more variability (right panel of Figure 10.7). At this point, we could try to fit alternative growth models for the mean weight at age or alternative models for the variance about the mean to see if we could better match the data. Nonetheless, the fit of the von Bertalanffy growth curve (Figure 10.6) provides a nice summary of the weight-at-age relationship. What if we wanted to add confidence or prediction intervals to our plot?
10.8 Confidence intervals for functions of parameters
The results from Section 10.6 provide a way for us to calculate uncertainty associated with our model parameters. I.e., if we let
(vcov.psi <- solve(fitvb$hessian)) [,1] [,2] [,3] [,4] [,5]
[1,] 2.737807861 -2.424357e-02 -0.0455615458 0.0052606707 -2.245891e-03
[2,] -0.024243567 3.165257e-04 0.0007662437 0.0001665308 -4.378909e-05
[3,] -0.045561546 7.662437e-04 0.0025635157 0.0016454637 -4.105409e-04
[4,] 0.005260671 1.665308e-04 0.0016454637 0.0346870394 -8.735923e-03
[5,] -0.002245891 -4.378909e-05 -0.0004105409 -0.0087359227 2.260206e-03
The diagonals of this matrix tell us about the expected variability of our parameter estimates (across repetitions of data collection and analysis), with the square root of these elements giving us our standard errors. Thus, we can form 95% confidence intervals for our model parameters using the assumption that the sampling distribution is Normal. For example, a 95% confidence interval for
SE.Linf<-sqrt(diag(vcov.psi))[1]
fitvb$par[1] + c(-1.96, 1.96)*rep(SE.Linf,2)[1] 83.86398 90.35014
All good, but again, what if we want a confidence interval for
Recall, in Chapter 5, we learned how to use matrix multiplication to get predicted values and their variance for linear models, possibly with non-constant variance. We briefly review those results here. Recall, if
- Use a bootstrap (see Chapter 2)
- Use the Delta method, which we will see shortly
- Switch to Bayesian inference and use posterior distributions
10.8.1 Delta Method
To understand how the delta method is derived, you have to know something about Taylor’s series approximations. Taylor’s series approximations are incredibly useful (e.g., they also serve to motivate some of the numerical methods we have used, including the BFGS method implemented by optim). We won’t get into the gory details here, but rather will focus on implementation. To implement the delta method, we need to be able to take derivatives of our function with respect to each of our parameters.
Let
Then, we can approximate the variance of
Because our function does not involve
To calculate the variance for a range of ages (e.g.,
age <- 1:35
mu.hat <- pars$Linf*(1-exp(-pars$K*(age-pars$t0)))
fprime<-as.matrix(cbind(1-exp(-pars$K*(age-pars$t0)),
pars$Linf*(age-pars$t0)*exp(-pars$K*(age-pars$t0)),
-pars$Linf*pars$K*exp(-pars$K*(age-pars$t0))))We then determine the variance using matrix multiplication, pulling off the diagonal elements (the off-diagonal elements hold the covariances between observations for different ages):
var.mu.hat <- diag(fprime %*% solve(fitvb$hessian)[1:3,1:3] %*%t(fprime))Alternatively, the emdbook package (B. Bolker 2023) has a function, deltavar, that will do the calculations for you if you supply a function for calculating meanval) and you pass the asymptotic variance covariance matrix, Sigma argument).
library(emdbook)
var.mu.hat.emd <- deltavar(linf*(1-exp(-k*(age-t0))),
meanval=list(linf=fitvb$par[1], k=fitvb$par[2], t0=fitvb$par[3]),
Sigma=solve(fitvb$hessian)[1:3,1:3])We can take the square root of the diagonal elements of var.mu.hat to get SEs for forming pointwise confidence intervals for
muhats <- data.frame(age=age,
mu.hat=mu.hat,
se.mu.hat = sqrt(var.mu.hat))
ggplot(beargrowth, aes(age, wtkg)) +
geom_point() + geom_line(aes(age, mu.hat), col="red", size=1) +
ylab("Weight (kg)") + xlab("Age (years)") + theme_bw() +
geom_ribbon(data=muhats, aes(x=age,
ymin=mu.hat-1.96*se.mu.hat,
ymax=mu.hat+1.96*se.mu.hat),
inherit.aes = FALSE, fill = "blue", alpha=0.5)Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Lastly, we could calculate prediction intervals, similar to the example in Chapter 5 by adding +/- 2
muhats <- muhats %>%
mutate(pi.up = mu.hat + 1.95*se.mu.hat + 2*sqrt(pars$sigma^2*mu.hat^(2*pars$theta)),
pi.low = mu.hat - 1.95*se.mu.hat - 2*sqrt(pars$sigma^2*mu.hat^(2*pars$theta)))
ggplot(beargrowth, aes(age, wtkg)) +
geom_point() + geom_line(aes(age, mu.hat), col="red", size=1) +
ylab("Weight (kg)") + xlab("Age (years)") + theme_bw() +
geom_ribbon(data=muhats, aes(x=age,
ymin=mu.hat-1.96*se.mu.hat,
ymax=mu.hat+1.96*se.mu.hat),
inherit.aes = FALSE, fill = "blue", alpha=0.2)+
geom_ribbon(data=muhats, aes(x=age,
ymin=pi.low,
ymax=pi.up),
inherit.aes = FALSE, fill = "red", alpha=0.2)10.9 Likelihood ratio test
A likelihood ratio test can be used to compare nested models with the same probability generating mechanism (i.e., same statistical distribution for the response variable). Parameters must be estimated using maximum likelihood, and we must be able to get from one model to the other by setting a subset of the parameters to specific values (typically 0).
In our slug example, we could use a likelihood ratio test to compare the following 2 models:
Model 1:
with for the Nursery and for the Rookery.
Model 2:
(i.e., )
The test statistic in this case is given by:
Asymptotically, the sampling distribution of the likelihood ratio test statistic is given by a optim (recall that the negative log-likelihood values at the maximum likelihood estimates are stored in the value slot of mle.fit.2lambdas and mle.fit):
(LR<-2*(-mle.fit.2lambdas$value + mle.fit$value))[1] 11.42156
This gives us a test statistic. To calculate a p-value, we need to compute the probability of obtaining a test statistic this large or larger, if the null hypothesis is true. If the null hypothesis is true, then the likelihood ratio statistic approximately follows a pchisq function to calculate this probability:
1-pchisq(LR, df=1); # or[1] 0.0007259673
pchisq(LR, df=1, lower.tail = FALSE)[1] 0.0007259673
We can verify this result by comparing it to what we would get if we used the anova function in R to compare the two models:
full.glm<-glm(slugs~field, data=slugs, family=poisson())
reduced.glm<-glm(slugs~1, data=slugs, family=poisson())
anova(full.glm, reduced.glm, test="Chisq")Analysis of Deviance Table
Model 1: slugs ~ field
Model 2: slugs ~ 1
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 78 213.44
2 79 224.86 -1 -11.422 0.000726 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
10.10 Profile likelihood confidence intervals
Typically, confidence intervals associated with maximum likelihood estimates are formed using the Normal approximation, confint function in R will calculate profile-likelihood based intervals by default for some models (e.g., mixed effect models fit using lmer; Chapter 18). Thus, it is a good idea to have some familiarity with this approach. Let’s demonstrate with a simple example, calculating a profile-likelihood based confidence interval for
A likelihood ratio test could be used to evaluate lambda.test). Our test statistic (Equation 10.6) would, in this case, compare the negative log-likelihoods when:
is set to the maximum likelihood estimator, (our unconstrained model) is set to a specific value, (our constrained model)
We would reject
To calculate the profile likelihood confidence interval, remember that the negative log-likelihood value at the maximum likelihood estimate is stored as value in the output returned by optim (and specifically, for this problem, as mle.fit$value). In addition, we already calculated the log-likelihood for several values of
# finds all values of lambda that are not rejected by the hypoth test
ind<-I(2*(-mle.fit$value - logL.vals) < qchisq(0.95, df=1))
(CI.95<-range(lambda.test[ind]))[1] 1.502803 2.081582
Let’s compare this interval to a Normal-based confidence interval. To do so, we need to repeat the optimization but request that optim also return the Hessian.
mle.fit <- mle.fit<-optim(par = 2,
fn = minus.logL.s,
method = "BFGS",
dat = slugs$slugs,
hessian = TRUE)Warning in dpois(dat, lambda, log = TRUE): NaNs produced
SE<-sqrt(solve(mle.fit$hessian))
mle.fit$par + c(-1.96, 1.96)*rep(SE,2)[1] 1.483049 2.066951
We see that, unlike the Normal-based interval, the profile likelihood interval is not symmetric and it includes slightly larger values of
We can extend this basic idea to multi-parameter models, but calculating profile-likelihood-based confidence intervals becomes much more computationally intensive. In order to calculate a profile-likelihood confidence interval for a parameter in a multi-parameter model, we need to:
- Consider a range of values associated with the parameter of interest (analogous to
lambda.testin our simple example). - For each value in [1], we fix the parameter of interest and then determine maximum likelihood estimates for all other parameters (i.e., we need to optimize the likelihood several times, each time with the focal parameter fixed at a specific value).
- We then use the likelihood-ratio test statistic applied to the likelihoods in [2] (our constrained model) and the likelihood resulting from allowing all parameters, including our focal parameter, to be estimated (our unconstrained model).
If we want to get confidence intervals for multiple parameters, we have to repeat this process for each of them. Alternatively, it is possible to calculate confidence regions for multi-parameter sets by considering more than 1 focal parameter at a time and a multi-degree-of-freedom likelihood-ratio bbmle package for an implementation in R (B. Bolker and R Development Core Team 2020).
10.11 Aside: Least squares and maximum likelihood
It is interesting and sometimes useful to know that the least-squares and maximum likelihood are equivalent when data are Normally distributed. To see this connection, note that the likelihood for data that are Normally distributed is given by:
With simple linear regression, we assume
Thus, the log-likelihood is given by:
If we want to maximize the log-likelihood with respect to
10.12 References
As an aside, if you came to me with this question, I would probably press you to justify why you think a hypothesis test is useful. As my friend and colleague Doug Johnson would point out, this null hypothesis is almost surely false (Johnson 1999). If we exhaustively sampled both fields, we almost surely would find that the densities differ, though perhaps not by much. The only reason why we might fail to reject the null hypothesis is that we have not sampled enough tiles. Thus, a p-value from a null hypothesis test adds little value. We would be better off framing our question in terms of an estimation problem. Specifically, we might want to estimate how different the two fields are in terms of their density. I.e., we should focus on estimating
and its uncertainty.↩︎If, alternatively, we had chosen a negative binomial distribution as our data-generating model, we would have
↩︎The log-likelihood for independent data is expressed as a sum of independent terms, which makes it relatively easy to derive asymptotic properties of maximum likelihood estimators using the Central Limit Theorem; see Section 10.6.↩︎
Using the rule that
↩︎Using the rules that
, , and ↩︎One example is: https://www.symbolab.com/solver/derivative-calculator↩︎
Most problems cannot be solved analytically. For example, generalized linear models with distributions other than the Normal distribution will require numerical methods when estimating parameters using maximum likelihood.↩︎
If you see a function that you are unfamiliar with and want to learn more, a good starting point is to access the help page for the function by typing
?functionname. Although R help files can be rather difficult to digest sometimes, it can be helpful to scroll down to the bottom and look for examples of how to use the functions can be used.↩︎Recall,
= standard deviation and a standard error is just a special type of standard deviation – i.e., a standard deviation of a sampling distribution!↩︎Again, if calculus is not your forte, online derivative calculators may prove useful↩︎
