Exercise 1 Solution

Includes:

Lion noses linear regression
Data generation consistent with model
Linear regression of this first dataset
In-class Sampling Distribution Simulation Assignment

Document Preamble

Load libraries

library(knitr)
library(abd)

Settings for Knitr (optional)

opts_chunk$set(fig.width = 8, fig.height = 6)

1. Lion noses linear regression:

Data entry

data(LionNoses)
head(LionNoses)

  age proportion.black
1 1.1             0.21
2 1.5             0.14
3 1.9             0.11
4 2.2             0.13
5 2.6             0.12
6 3.2             0.13

Fit linear model

lm.nose<-lm(age~proportion.black, data=LionNoses)

Parameters:

Coefficients and residual variation are stored in lmfit:

coef(lm.nose)

     (Intercept) proportion.black 
       0.8790062       10.6471194

summary(lm.nose)$sigma # residual variation

[1] 1.668764

What else is stored in lmfit? (residuals, variance covariance matrix, etc)

names(lm.nose)

 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"

names(summary(lm.nose))

 [1] "call"          "terms"         "residuals"     "coefficients" 
 [5] "aliased"       "sigma"         "df"            "r.squared"    
 [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

2. Data generation consistent with fitted model

## Use the same sampmle size Sample size - use length so it matches sample size of original data
n <- length(LionNoses$age)

## Predictor - copy of original proporation black data, now in vector
p.black <- LionNoses$proportion.black

## Parameters
sigma <- summary(lm.nose)$sigma # residual variation
betas <- coef(lm.nose)# regression coefficients

## Errors and response
# Residual errors are modeled as ~ N(0, sigma)
epsilon <- rnorm(n, 0, sigma)

# Response is modeled as linear function plus residual errors
y <- betas[1] + betas[2]*p.black + epsilon

3. Linear regression of this generated dataset

# Fit of model to simulated data:  
lmfit.generated <- lm(y ~ p.black)
summary(lmfit.generated)


Call:
lm(formula = y ~ p.black)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8741 -0.7826 -0.3292  0.7713  3.2279 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.8138     0.5025   1.620    0.116    
p.black      10.8301     1.3335   8.121 4.58e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.474 on 30 degrees of freedom
Multiple R-squared:  0.6874,    Adjusted R-squared:  0.6769 
F-statistic: 65.96 on 1 and 30 DF,  p-value: 4.579e-09

In-Class Sampling Distribution Simulation Assignment

Exercise 1:

Generate 5000 datasets using the same code
Fit a linear regression model to each dataset “lm.temp”
Store the estimates of \(\beta_1\)

Hint: if you get stuck, try starting with a small number of simulations (less than 5000) until you get the code right.

#   set up a matrix of size 5000 by 1 to store our estimates of beta_1
nsims <- 5000 # number of simulations
beta.hat<- matrix(NA,   nrow    =   nsims,  ncol    =   1)

# Simulation
for(i in 1:nsims){
  epsilon <- rnorm(n, 0, sigma) # random errors
  y <- betas[1] + betas[2]*p.black + epsilon # response
  lm.temp <- lm(y ~ p.black)
  ## extract beta-hat  
  beta.hat[i] <- coef(lm.temp)[2] 
}

Plot results

hist(beta.hat, col="gray",xlab="", main=expression(paste("Sampling Distribution of ", hat(beta)[1])))
abline(v=betas[2]) # add population parameter

Histogram showing the sampling distribution

Document footer

Session Information:

sessionInfo()

R version 4.5.1 (2025-06-13 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] abd_0.2-8         mosaic_1.9.2      mosaicData_0.20.4 ggformula_0.14.0 
 [5] dplyr_1.1.4       Matrix_1.7-4      ggplot2_4.0.0     lattice_0.22-7   
 [9] nlme_3.1-168      knitr_1.50       

loaded via a namespace (and not attached):
 [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.1     tidyselect_1.2.1  
 [5] Rcpp_1.1.0         stringr_1.5.2      dichromat_2.0-0.1  tidyr_1.3.1       
 [9] systemfonts_1.2.3  scales_1.4.0       labelled_2.15.0    uuid_1.2-1        
[13] yaml_2.3.10        fastmap_1.2.0      R6_2.6.1           generics_0.1.4    
[17] MASS_7.3-65        forcats_1.0.1      htmlwidgets_1.6.4  mosaicCore_0.9.5  
[21] tibble_3.3.0       pillar_1.11.1      RColorBrewer_1.1-3 rlang_1.1.6       
[25] stringi_1.8.7      xfun_0.53          S7_0.2.0           ggiraph_0.9.1     
[29] cli_3.6.5          withr_3.0.2        magrittr_2.0.4     digest_0.6.37     
[33] rstudioapi_0.17.1  haven_2.5.5        hms_1.1.3          lifecycle_1.0.4   
[37] vctrs_0.6.5        evaluate_1.0.5     glue_1.8.0         farver_2.1.2      
[41] rmarkdown_2.29     purrr_1.1.0        tools_4.5.1        pkgconfig_2.0.3   
[45] htmltools_0.5.8.1  ggridges_0.5.7