Foundations of Statistics with R
2020-08-10
Preface
This book is a first course in probability and statistics, using R. The book assumes a mathematical background of Calculus II, though much of the book can be read with a much lower level of mathematics. The book incorporates R throughout all chapters via simulations, data wrangling and/or data visualization. Tidyverse and base R tools are interspersed, depending on which is better for a particular job. For most visualizations, ggplot is used, and for most data wrangling, dplyr is used. An emphasis is placed on understanding results (and robustness to departures from assumptions) via simulation.
Much of the data in this book is provided in the package fosdata
. Most of these data sets are from recent, open access publications that are linked to in the help pages of the data. We encourage you to spend some time reading the research papers that were written using the data in the book. We have taken two approaches to the data from original papers. In some instances, we provide essentially all of the data from the paper. This allows you to explore the data further and think about other visualizations and analyses that would be useful. It also typically requires some wrangling to get the data in a format for the analysis. In other instances, we have simplified the data from the paper significantly. In particular, in a few instances we have modified the data by filtering out observations or averaging in order to make it reasonable to assume independence. Please see the links provided in the help pages of fosdata
for details.
Other data is found in other freely available R packages. You will need to install these packages and fosdata
to access the data.
Our philosophy in this book is to not shy away from messy data sets.
The book contains extensive sections and many exercises that require data cleaning and manipulation. This is an essential part of the course.
We are assuming a knowledge of calculus through the Fundamental Theorem of Calculus, partial differentiation, and some knowledge of infinite series. However, many parts of the book do not require calculus, and since the book emphasizes simulations, most parts of the book can be understood without using calculus. Proofs of many results are provided, and justifications via simulations for many more, but this text is not intended to support a proof based course. Readers are encouraged to follow the proofs, but often one wants to understand a proof only after first understanding the result and why it is important.
This is book is written in rmarkdown using the bookdown
package. If, or rather when, you find typos, mistakes or ideas for improvement, please contact the authors. The original idea for a course of this type is due to Michael Lamar. The authors wish to thank Matt Schuelke for helpful discussions regarding aspects of this book.
This book is copyright 2020, Darrin Speegle and Bryan Clair. Do not transmit or reuse without express permission.
0.1 Further reading
No book like this would be complete without a list of books that would be useful for the student who wishes to learn more. Here is a list of other resources that that the authors have enjoyed learning from.
- ggplot2, by Hadley Wickham, gives a nice overview of the capabilities of the
ggplot2
package . Students interested in data visualization would find this book interesting. - Advanced R, by Hadley Wickham, provides much more information on R than what we cover in this book. Computer Science students might enjoy reading this book.
- The Statistical Sleuth, by Ramsey and Schafer, will help the student think more like a statistician when dealing with data sets. This book is on a lower level mathematically.
- Modern Applied Statistics with S, by Venables and Ripley, is a book that covers more advanced statistical topics without much mathematics.
- Introductory Statistics with R, by Peter Dalgaard, is a concise introduction to using R for many types of statistical procedures.
- Mathematical Statistics with Applications, by Wackerly, Mendenhall and Schaeffer, is a more mathematical (but still only requiring multivariate calculus and perhaps basic linear algebra) look at the topics of this book. Students interested in the theory and proofs behind the material in this book would enjoy reading it.
- The package
data.table
. If we were going to include a package that isn’t part of thetidyverse
, this would be it. It is widely used for data manipulation, and can be significantly faster than other R tools for many classes of problems. If you find that your code is running too slowly, it may be time to learndata.table
.
The interested reader who would like some review of elementary combinatorics can see Khan Academy. Another source with even more examples of counting arguments is the first chapter of Discrete and Combinatorial Mathematics, by Ralph Grimaldi.
0.2 Installing R and RStudio
R is a programming language, distributed as its own software program.
To install R:
Mac users
- Visit the CRAN archive, at https://cran.r-project.org
- Find the link that looks like “R-x.x.x.pkg” under the Latest release heading.
- Download the “R-x.x.x.pkg” file, double-click it to open, and follow the installation instructions.
Windows users
- Visit the CRAN archive, at https://cran.r-project.org
- Click on the “Download R for Windows” link at the top of the page.
- Click on the “base” link.
- Click the large “Download R x.x.x for Windows” link and save the executable file somewhere on your computer.
- Run the .exe file and follow the installation instructions.
RStudio is a graphical interface to R. R can work without RStudio, but RStudio requires R to work. Though you may choose to use R in its native form, the improvements that come with RStudio are absolutely worth the effort to install it. In fact, once you have RStudio installed, there is little need to ever run the R program itself.
To install RStudio:
- Go to www.rstudio.com and click on the “Download RStudio” button.
- Click on “Download RStudio Desktop.”
- Click on the version recommended for your system and install it.
Finally, you might choose to install the tidyverse
and the book’s data set at this time, by finding the Console in RStudio and typing:
install.packages("tidyverse")
install.packages("remotes")
remotes::install_github(repo = "speegled/fosdata")
More details on package installation are in Section 1.7.