Exercise 3 Solution

L01: Simulations to understand sampling distributions

Includes:

Lion noses linear regression
Data generation consistent with model
Linear regression of this first dataset
In-class Sampling Distribution Simulation Assignment

Document Preamble

Load libraries

library(knitr)
library(abd)

Settings for Knitr (optional)

opts_chunk$set(fig.width = 8, fig.height = 6)

1. Lion noses linear regression:

Data entry

data(LionNoses)
head(LionNoses)

  age proportion.black
1 1.1             0.21
2 1.5             0.14
3 1.9             0.11
4 2.2             0.13
5 2.6             0.12
6 3.2             0.13

Fit linear model

lm.nose<-lm(age~proportion.black, data=LionNoses)

Parameters:

Coefficients and residual variation are stored in lmfit:

coef(lm.nose)

     (Intercept) proportion.black 
       0.8790062       10.6471194

summary(lm.nose)$sigma # residual variation

[1] 1.668764

What else is stored in lmfit? (residuals, variance covariance matrix, etc)

names(lm.nose)

 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"

names(summary(lm.nose))

 [1] "call"          "terms"         "residuals"     "coefficients" 
 [5] "aliased"       "sigma"         "df"            "r.squared"    
 [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

2. Data generation consistent with fitted model

## Use the same sampmle size Sample size - use length so it matches sample size of original data
n <- length(LionNoses$age)

## Predictor - copy of original proporation black data, now in vector
p.black <- LionNoses$proportion.black

## Parameters
sigma <- summary(lm.nose)$sigma # residual variation
betas <- coef(lm.nose)# regression coefficients

## Errors and response
# Residual errors are modeled as ~ N(0, sigma)
epsilon <- rnorm(n, 0, sigma)

# Response is modeled as linear function plus residual errors
y <- betas[1] + betas[2]*p.black + epsilon

3. Linear regression of this generated dataset

# Fit of model to simulated data:  
lmfit.generated <- lm(y ~ p.black)
summary(lmfit.generated)


Call:
lm(formula = y ~ p.black)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8583 -1.6573  0.5004  1.5111  3.4417 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.9734     0.6711   1.450    0.157    
p.black       9.7032     1.7810   5.448 6.57e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.969 on 30 degrees of freedom
Multiple R-squared:  0.4974,    Adjusted R-squared:  0.4806 
F-statistic: 29.68 on 1 and 30 DF,  p-value: 6.57e-06

In-Class Sampling Distribution Simulation Assignment

Exercise 3:

Generate 5000 datasets using the same code
Fit a linear regression model to each dataset “lm.temp”
Store the estimates of \(\beta_1\) and t-statistics
Calucate confidence limits for each simulation and determine how many include the true parameter used to simulate the data.

Hint: if you get stuck, try starting with a small number of simulations (less than 5000) until you get the code right.

#   set up a matricies to hold results 
nsims <- 5000 # number of simulations
beta.hat<- matrix(NA,   nrow    =   nsims,  ncol    =   1) # estimates of beta_1
tsamp.dist<-matrix(NA, nsims, ncol = 1) # matrix to hold t-statistics
limits <- matrix(NA, nrow = nsims, ncol = 2) # matrix to hold CI limits 
colnames(limits) <- c("LL.slope","UL.slope")# label columns

# Simulation
for(i in 1:nsims){
  epsilon <- rnorm(n, 0, sigma) # random errors
  y <- betas[1] + betas[2]*p.black + epsilon # response
  lm.temp <- lm(y ~ p.black)
  ## extract beta-hat  
  beta.hat[i] <- coef(lm.temp)[2] 
  # Here is our t-statistic, calculated for each sample
  tsamp.dist[i]<-(beta.hat[i]-betas[2])/sqrt(vcov(lm.temp)[2,2])
  # Confidence limits
  limits[i,] <- confint(lm.temp)[2,] 
}

How many CI include the parameter used to generate the data?

# Indicator of whether "true" parameter is within confidence intervals
I.in <- betas[2] >= limits[,1] & betas[2] <= limits[,2]

# Proportion of confidence intervals with true beta
sum(I.in)/nsims

[1] 0.9504

Plot earlier results

par(mfrow=c(1,2))
hist(beta.hat, col="gray",xlab="", main=expression(paste("Sampling Distribution of ", hat(beta)[1])))
abline(v=betas[2]) # add population parameter
hist(tsamp.dist, xlab="",
     main=expression(t==frac(hat(beta)-beta, se(hat(beta)))), freq=FALSE)
tvalues<-seq(-3,3, length=1000) # xvalues to evaluate t-distribution
lines(tvalues,dt(tvalues, df=30)) # overlay t-distribution

Sampling distribution with t-distribution overlayed

Plot results of confidence limits (first 100 of them)

sim.dat<-data.frame(est.slope=beta.hat, limits, In=I.in) 
ggplot(sim.dat[1:100,], aes(x=est.slope, y=1:100, colour=as.factor(In))) +
  geom_segment(aes(x=LL.slope, xend=UL.slope, yend=1:100, colour=as.factor(In))) +
  scale_colour_discrete(name=expression(paste("Contains ", beta, "?"))) +
  geom_point() +
  theme(axis.text.y=element_blank()) +
  geom_vline(xintercept=betas[2]) +
  labs(x = "Estimate", y = " ", 
       alt = "Plot of 100 confidence intervals showing whether or not they contain the true parameter")

Plot of 100 confidence intervals showing whether or not they contain the true parameter

Document footer

Session Information:

sessionInfo()

R version 4.5.1 (2025-06-13 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] abd_0.2-8         mosaic_1.9.2      mosaicData_0.20.4 ggformula_0.14.0 
 [5] dplyr_1.1.4       Matrix_1.7-4      ggplot2_4.0.0     lattice_0.22-7   
 [9] nlme_3.1-168      knitr_1.50       

loaded via a namespace (and not attached):
 [1] generics_0.1.4     tidyr_1.3.1        stringi_1.8.7      hms_1.1.3         
 [5] digest_0.6.37      magrittr_2.0.4     evaluate_1.0.5     RColorBrewer_1.1-3
 [9] fastmap_1.2.0      jsonlite_2.0.0     purrr_1.1.0        scales_1.4.0      
[13] cli_3.6.5          labelled_2.15.0    rlang_1.1.6        withr_3.0.2       
[17] yaml_2.3.10        tools_4.5.1        uuid_1.2-1         mosaicCore_0.9.5  
[21] forcats_1.0.1      vctrs_0.6.5        R6_2.6.1           ggridges_0.5.7    
[25] lifecycle_1.0.4    stringr_1.5.2      htmlwidgets_1.6.4  MASS_7.3-65       
[29] pkgconfig_2.0.3    pillar_1.11.1      gtable_0.3.6       glue_1.8.0        
[33] Rcpp_1.1.0         systemfonts_1.2.3  haven_2.5.5        xfun_0.53         
[37] tibble_3.3.0       tidyselect_1.2.1   rstudioapi_0.17.1  ggiraph_0.9.1     
[41] dichromat_2.0-0.1  farver_2.1.2       htmltools_0.5.8.1  rmarkdown_2.29    
[45] labeling_0.4.3     compiler_4.5.1     S7_0.2.0