Preface

This is a book on probability and statistics suitable for the sophomore or junior level at university. We assume knowledge of calculus at the level of Calculus II. We do not assume prior experience with statistics or programming, though students who have no experience with either statistics or programming before starting this class should expect to have to work hard. We will be using R as an integral part of the exposition — you should not read this book without first getting R Studio installed. We will use base R and tidyverse tools for the most part, with a few exceptions. We will not call on the full realm of tidyverse tools, but focus rather on:

  1. ggplot2 is used for all graphics. We learned ggplot from the fantastic book, ggplot2 by Hadley Wickham. I strongly encourage readers to refer to that book whenever they have a question about plotting.
  2. dplyr is used for data manipulation. Whenever a data manipulation problem gets tricky, we will switch to the methods of dplyr.
  3. The gather function from the package tidyr is used in several places.

We will also use datasets from ISwR, MASS and Sleuth3 frequently, and from car, and babynames in places.

No book like this would be complete without a list of books that would be useful for the student who wishes to learn more than what is in this book. Here is a list of other resources that I have enjoyed learning from.

  1. ggplot2, by Hadley Wickham, gives a nice overview of the capabilities of the ggplot2 package. Students interested in data visualization would find this book interesting.
  2. Advanced R, by Hadley Wickham, provides much more information on R than what we cover in this book. Computer Science students might enjoy reading this book.
  3. The Statistical Sleuth, by Ramsey and Schafer, will help the student think more like a statistician when dealing with data sets.
  4. Modern Applied Statistics with S, by Venables and Ripley, is a more advanced book.
  5. Introductory Statistics with R, by Peter Dalgaard, is a concise introduction to using R for many types of statistical procedures.
  6. Mathematical Statistics with Applications, by Wackerly, Mendenhall and Schaeffer, is a more mathematical (but still only requiring multivariate calculus and perhaps basic linear algebra) look at the topics of this book. Students interested in the theory and proofs behind the material in this book would enjoy reading it.
  7. Datacamp is an on-line, interactive tutorial based method of learning data science in general, and R in particular.
  8. The package data.table. If I were going to inculde a package that isn’t part of the tidyverse, this would be it. It is widely used for data manipulation, and I have found it to be significantly faster than other R tools for many classes of problems. If you find that your code is running too slowly, it may be time to learn data.table.

Our philosophy in this book is to not shy away from messy data sets. We have relatively extensive sections on getting data into its correct format, and we have many exercises with dealing with data. I feel that this is an essential part of the course.

We are assuming a knowledge of calculus through the Fundamental Theorem of Calculus, some knowledge of infinite series, and integration. However, the main parts of the book do not require calculus. Proofs of some results are provided, but this is not intended to support a proof based course. I strongly encourage students to understand the proofs, but often one only wants to understand the proof after understanding the result and why it is important. When we use mathematical background that some students may be rusty on, we provide a link to the Wikipedia page on that topic.

This is book is written in Markdown using the bookdown package. If, or should I rather say when, you find typos, mistakes or ideas for improvement, please create a pull request on bitbucket here. The original idea for a course of this type is due to Michael Lamar.

Thanks to Anna Medley and Conor for finding typos!

0.1 Installing R

Mac Users

  1. visit here,
  2. Click on the “Download R for (Mac) OS X” link at the top of the page.
  3. Click on the file containing the latest version of R under “Files.” (Currently 3.4.1, as of July 13, 2017).
  4. Save the .pkg file, double-click it to open, and follow the installation instructions.
  5. Next we need to install RStudio.
  6. Go to www.rstudio.com and click on the “Download RStudio” button.
  7. Click on “Download RStudio Desktop.”
  8. Click on the version recommended for your system, or the latest Mac version, save the .dmg file on your computer, double-click it to open, and then drag and drop it to your applications folder.
  9. Finally we install crucial packages.
  10. Open RStudio.
  11. At the > prompt inside Console, type (or copy and paste) install.packages(c("ggplot2", "dplyr", "ISwR", "MASS", "Sleuth3", "rmarkdown", "tidyr")).

Windows Users

  1. visit here,
  2. Click on the “Download R for Windows” link at the top of the page.
  3. Click on the “base” link.
  4. Click “R-patched.exe” and save the executable file somewhere on your computer. Run the .exe file and follow the installation instructions.
  5. Next we need to install RStudio.
  6. Go to www.rstudio.com and click on the “Download RStudio” button.
  7. Click on “Download RStudio Desktop.”
  8. Click on the version recommended for your system, or the latest Windows version, and save the executable file. Run the .exe file and follow the installation instructions.
  9. Finally we install crucial packages.
  10. Open RStudio.
  11. At the > prompt inside Console, type (or copy and paste) install.packages(c("ggplot2", "dplyr", "ISwR", "MASS", "Sleuth3", "rmarkdown", "tidyr")).