library(ggthemes)
theme_set(theme_bw())
Preface
Ecological data pose many challenges to statistical inference. Most data come from observational studies rather than designed experiments; observational units are frequently sampled repeatedly over time, resulting in multiple, non-independent measurements; response data are often binary (e.g., presence-absence data) or non-negative integers (e.g.., counts), and therefore, the data do not fit the standard assumptions of linear regression (Normality, independence, and constant variance). This book will familiarize readers with modern statistical methods that address these complexities using both frequentist and Bayesian frameworks.
In the first part of the book, we focus on models appropriate for Normally distributed response variables. We begin by reviewing key concepts in frequentist statistics (sampling distributions, confidence intervals, p-values, etc.) in the context of linear regression (Section 1 Linear regression review), but also introduce the bootstrap as a useful tool for inference when common assumptions may not be met (2 Bootstrapping). We then move to multiple regression (3 Multiple regression), emphasizing the role of design matrices when formulating models that include categorical predictor variables, interaction terms, and non-linear predictor-response relationships (4 Modeling Non-linear relationships).
In part 2 of the book, we consider methods for choosing which variables to include in a model. We explore the impact of collinearity on parameter uncertainty (6 Multicollinearity) and the important role that causal diagrams should play when choosing which variables to include in a model (7 Causal Inference). Lastly, we discuss different modeling objectives (describing, predicting, and inferring), and we explore the relative utility of various modeling strategies for meeting these objectives (8 Modeling Strategies).
In part 3 of the book, we explore other statistical distributions besides the Normal distribution (9 Introduction to probability distributions). We also review important concepts in mathematical statistics, which allow us to develop a better understanding of maximum likelihood (10 Maximum likelihood) and Bayesian inferential frameworks (Sections 11 Introduction to Bayesian statistics, 12 A Brief introduction to MCMC sampling and JAGS, 13 Bayesian linear regression). Then, in part 4 of the book, we consider extensions for modeling non-Normal data using generalized linear models (Sections 14 Introduction to generalized linear models (GLMs), 15 Regression models for count data, 16 Logistic regression and models that account for zero-inflated data (17 Models for zero-inflated data). Lastly, in part 5 of the book, we discuss methods for modeling correlated data, including mixed models (18 Linear Mixed Effects Models and 19 Generalized linear mixed effects models (GLMMs)) and generalized estimating equations (20 Generalized Estimating Equations (GEE)).
Formatting and conventions
I frequently use red to highlight important concepts when first introduced (e.g., sampling distribution).
Before compiling the code in each chapter, I run a short script that sets the default theme for all ggplots:
Pre-requisites
Ideally, you will have an understanding of key statistical concepts, including hypothesis tests, the Normal distribution, and linear regression. In addition, you should have a working knowledge of the R programming language (e.g., be able to read in data, work with common object types such as lists, matrices, data frames, install and load packages, access help functions, and construct simple plots).
Although I envision this book as appropriate for graduate students taking a second course in statistics, I recognize that not all introductory statistics courses are created equal; the experience and background knowledge of students entering my graduate-level statistics class is highly variable. Thus, although it would be nice if all students came in with a thorough understanding of key statistical concepts, an ability to code in R, and a solid foundation of linear models, it is rare that any student comes in with a solid foundation in all three areas. I generally try to include enough background and supporting material to overcome these deficiencies. Yet, some readers may find it beneficial to also seek out additional resources on one or more of these topics.
Learning objectives
The overarching goal of this book is to help train students to effectively analyze the data they collect as part of their research. Much of the book focuses on commonly used statistical models (e.g., linear and generalized linear models). Of course, it is impossible to cover all statistical methods that one might someday need. Further, a superficial understanding of topics only gets on one so far. Therefore, I try to emphasize key concepts and techniques that provide a solid foundation for further learning rather than a statistical “cookbook” with a set of recipes for different data situations. By working with several different classes of models, and both frequentist and Bayesian implementations, my hope is that you will develop strong coding skills and an enhanced ability to reason using mathematics and statistics. The repeated exposure to mathematical concepts and formulas should also make it easier for you to read and understand a wider range of literature. Hopefully, you will find quantitative papers to be much less intimidating!
By the end of the book, you should be able to:
- Construct models that address specific biological objectives.
- Understand the role of random variables and common statistical distributions in formulating modern regression models.
- Demonstrate model literacy, i.e., you should be able to describe a variety of statistical models and their assumptions using equations and text and match parameters in these equations to estimates in computer output.
- Identify key model assumptions, utilize diagnostic tools to assess the validity of these assumptions, and conduct sensitivity analyses to evaluate model robustness to assumption violations.
- Gain an appreciation for challenges associated with selecting among competing models and performing multi-model inference.
- Critique statistical methods used in the applied literature, identify strengths and weaknesses of different modeling approaches, and select appropriate analyses in your work.
To achieve the above learning objectives, you will be expected to develop new statistical modeling and computing skills (see Skills Objectives).
Skills objectives
By the end of this course, you should be able to:
- Construct predictor variables that allow fitting of models with categorical predictors and that allow for non-linear relationships between explanatory and response variables.
- Fit and evaluate a variety of regression models in both frequentist and Bayesian frameworks using open-source software (R and JAGS).
- Use simulation-based methods to test your understanding of key statistical concepts and models and to evaluate the plausibility of different model assumptions.
- Estimate quantities of interest along with their associated measures of uncertainty (e.g., confidence and prediction intervals) for a variety of commonly used regression models.
- Model non-Normal data using generalized linear models.
- Model correlated data using mixed models and generalized estimating equations; estimate robust standard errors by performing a cluster-level bootstrap (resampling independent observational units).
Real versus simulated data
In this book, we will use a combination of real and simulated data sets. Most of the data sets encountered in this book are contained in various R packages, including the Data4Ecologists
package, which I created to go along with this book. This package can be installed in R using:
::install_github("jfieberg/Data4Ecologists") devtools
Students are sometimes skeptical of the value of working with simulated data, but there are many reasons why working with simulated data is useful (Kéry 2010):
- With simulated data, we know the truth, and thus, we can compare estimates to truth. This is the best way to see if we can recover the parameter values used to generate the simulated data.
- Simulations can help us determine if we have coding errors, particularly when we are writing new code or developing a new analytic method. If we cannot recover the parameters values used to simulate data, that may indicate we have a bug in our code.
- Simulated data can facilitate understanding of sampling distributions, one of the most important concepts in statistics.
- We can study the properties of an estimator (its mean, its variance, and whether it is robust to assumption violations).
At the same time, we should be wary of methods only shown to work with simulated data, especially when all assumptions are met. With real data, common assumptions are almost never perfectly met. What then? We need to be able to identify the most critical assumptions, evaluate the impact of violations to them, and come up with strategies that perform well even when assumptions are not perfectly met. Again, simulations can play a significant role here.
Standing on the shoulders of…
I have borrowed many ideas, data sets, and in some cases code from a variety of sources when putting together this book. Of particular importance were:
- The Lock family’s book, Unlocking the Power of Data (Lock et al. 2020), which I have used to teach my undergraduate course in introductory statistics for many years.
- Jack Weiss’s courses on statistics for ecologists and environmental scientists at the University of North Carolina-Chapel Hill (unfortunately, he passed away a few years ago, and his web sites are no longer easy to track down).
- Marc Kery’s Introduction to WinBugs for Ecologists (Kéry 2010)
- Zuur et al’s Mixed Effects Models and Extensions in Ecology with R (Zuur et al. 2009)
- Ben Bolker’s Ecological Models and Data in R (Bolker 2008)
I typically include both Kery’s and Zuur et al’s books as recommended texts when I teach FW8051, Statistics for Ecologists at the University of Minnesota. Ben Bolker’s book is also a great resource for ecologists interested in constructing semi-mechanistic models and does a nice job of introducing frequentist and Bayesian inferential frameworks.
I began writing this book after having taught for several years, using my lecture slides to seed the content. I have attempted to trace ideas back to original authors and credit their work whenever possible. Unless otherwise indicated, third-party texts, images, and other materials quoted in these materials are included on the basis of fair use as described in the Code of Best Practices for Fair Use in Open Education. If you see any material, copyrighted or otherwise, that is not properly acknowledged, please let me know so that I can correct any mistakes that I have made.
How to use this book and associated resources for teaching
I am currently developing other resources concurrent with drafting this book. These resources include:
- A separate companion document with suggested exercises. A current draft can be found here. If you are teaching from the book, I am also happy to share solutions to the exercises.
- An R package containing data sets used in the book and in classroom and homework exercises.
As you read through the book, you will occasionally see Think-Pair-Share questions posed, which are meant to force the reader to pause, and reflect on a concept that has recently been introduced. In the classroom, these questions can lead to useful discussions among peers and serve as a test of their understanding.
Feedback
I hope this resource will be useful for others, especially biologists trying to increase their statistical abilities. If you use this book in your course, or if you find any errors or have suggestions or feedback for improving the content, please let me know using this google form. In addition, if you have data sets that would be useful for creating additional exercises, please reach out to me. Ultimately, I would like the second edition of the Exercise Book to be titled, Exercises: Statistics for Ecologists by Ecologists.
Lastly, I have adopted some code from Ben Marwick’s modified “A Minimal Bookdown Example” that allows readers to provide feedback directly on each page of the book using hypothes.is. Readers can highlight text or provide comments once they sign up for an account. This idea was inspired by Matthew Salganik’s Open Review Toolkit. The code for the Open Review Toolkit has an MIT license and Ben Marwick’s code is is licensed under a Creative Commons Attribution 4.0 International License.
Acknowledgements
I thank the many students I have had in class that have made teaching so worthwhile, and my wife and family for their patience with me as I worked on the book. I thank Alex Bajcz, Garrett Street, and Olivia Douell for reading through the first several chapters and providing thoughtful comments and suggestions. I also thank the many people who have provided comments and suggestions using hypothes.is, especially Robert Buck (StatAnswers Consulting LLC) and Smith Freeman, who read through and commented on the full book, Bert van der Veen (Norwegian University of Science and Technology) who provided extremely helpful suggestions on the multicollinearity chapter, Jarrett Byrnes (UMASS Boston) who has used the book for one of his courses and had several students comment using hypothes.is, and Tiago Marques (University of St. Andrews) who offered extra credit to his students for sending useful comments (there were many!). I appreciate helpful comments and suggestions from Jordan Heiman, Dani Freund, Maija Weaver, Paul Freetown, Johannes Signer, Nick Dulvy, and the list of hypothes.is user names provided below – if you find yourself in this list and are willing to be acknowledged here, please contact me!
Hypothes.is user names: deClare_125, angeliquedenise, bbrandon, brow6589, chris_laRosee, cpolik, dochvam, forester, frequentist_stats, irrigavi, MATH4Stat, michael_d, qnn, satyadeviw, Teleopsis, and tschafer.