Chapter 1 Data in R
The R Statistical Programming Language plays a central role in this book. While there are several other programming languages and software packages that do similar things, we chose R for several reasons
- R is widely used among statisticians, especially academic statisticians. If there is a new statistical procedure developed somewhere in academia, chances are that the code for it will be made available in R. This distinguishes R from, say, Python, which would be our choice for machine learning.
- R is commonly used for statistical analyses in many disciplines. Other software, such as SPSS or SAS is also used and in some disciplines would be the primary choice for some discipline specific courses, but R is often used and is growing.
- R is free. You can install it and all optional packages on your computer at no cost. This is a big differentiator between R and SAS, SPSS, Matlab and most other statistical software.
- R has been experiencing a renaissance. With the advent of the tidyverse and RStudio, R is a vibrant, young and growing community. We also have found the community to be extremely welcoming.
In this chapter, we will begin to see some of the capabilities of R. We point out that R is a fully functional programming language, as well as being a statistical software package, though we will not explore the nuances of R as a programming language much in this book.
1.1 Arithmetic and Variable Assignment
We begin by showing how R can be used as a calculator. Here is a table of commonly used arithmetic operators.
operator | description | example |
---|---|---|
+ | addition | 1 + 1 |
- | subtraction | 4 - 3 |
* | multiplication | 3 * 7 |
/ | division | 8/3 |
^ | exponentiation | 2^3 |
Here is the output of the examples in the previous table, for completeness.
## [1] 2
## [1] 1
## [1] 21
## [1] 2.666667
## [1] 8
A couple of useful constants in R are pi
and exp(1)
, which are \(\pi \approx 3.141593\) and \(e \approx 2.718282.\) Here R a couple of examples of how you can use them.
## [1] 9.869604
## [1] 5.436564
R is a functional programming language. If you don’t know what that means, that’s OK, but as you might guess from the name, functions play a large role in R. We will see many, many functions throughout the book. Every time you see a new function, think about the following three questions.
- What type of input does the function accept?
- What does the function do?
- What does the function return as output?
- What are some typical examples of how to use the function?
In this section, we focus on functions that do things that you are likely already familiar with from your previous math courses.
We start with exp
. The function exp
takes one argument, named x
and returns \(e^x\). So, for example, exp(x = 1)
will compute \(e^1 = e\), as we saw above. In R, it is optional as to whether you supply the named version x = 1
or just 1
as the argument. So, it is equivalent to write exp(x = 1)
or exp(1)
. Typically, for functions that are “well-known”, the first argument or two will be given without names, then the rest will be provided with their names. Our advice is that if in doubt, include the name.
Next, we discuss the log
function. The function log
takes two arguments, one required, x
, and one optional with a default, base
, and returns \(\log_{b}x\), where \(b\) is the base
. The default value of base
is exp(1)
; in other words, the default logarithm is the natural logarithm. Here are some examples of using exp
and log
.
## [1] 7.389056
## [1] 2.079442
## [1] 3
If you type ?log
in the console, you will see the help page for log
and exp
. We encourage you to read the help page, so that you can start to get a feel for the layout of the help pages for functions. For example, under the heading “Description,” the help page says that log
computes logarithms, by default natural logarithms. Under the heading “Usage”, it gives log(x, base = exp(1))
. This indicates that log
takes two arguments, x
and base
, and that base
has a default of exp(1)
. Under the heading “Arguments,” it states that x
is a numeric or complex vector, and base
is a positive or complex number. This might be confusing for now, because we indicated that x
was a positive real number above, but the help page indicates that log
is more flexible than what we have seen so far. Finally, under the heading “Examples,” the help page provides code that you can copy and paste into the R console to see what the function does. In this case, it provides among other things, log(exp(3))
, which is equal to 3. It can take some time to get used to reading R Documentation. For now, we recommend reading those 4 headings to see whether there are things you can learn about new functions. Don’t worry if there are things in the documentation that you don’t yet understand.
You can’t get very far without storing results of your computations to variables! The way to do so is with the <-
command, as shown below1.
Note that Alt
+ -
is the keyboard shortcut for <-
when working on the command line. (That means that <-
is one keystroke less than =
!) The #in inches
part of the code above is a comment. These are provided to give the reader information about what is going on in the R code, but are not executed and have no impact on the output.
If you want to see what value is stored in a variable, you can
- type the variable name
## [1] 192
look in the environment box in the upper right hand corner of RStudio.
Use the str command. This command gives other useful information about the variable, in addition to its value.
## num 192
This says that height contains num-eric data, and its current value is 192 (which is 3(62 + 2)). Note that there is a big difference between typing height + 2
(which computes the value of height + 2
and displays it on the screen) and typing height <- height + 2
, which computes the value of height + 2
and stores the new value back in height
.
It is important to choose your variable names wisely. Variables in R cannot start with a number, and for our purposes, they should not start with a period. Do not use T
or F
as a variable name. Think twice before using c, q, t, C, D, or I as variable names, as they are already defined. It may also a bad idea (and is one of the most frustrating things to debug on the rare occasions that it causes problems) to use sum
, mean
, or other commonly used functions as variable names. T
and F
are variables with default values TRUE
and FALSE
, which can be changed. we recommend writing out TRUE
and FALSE
rather than using the shortcuts T
and F
for this reason.
We also misspoke when we said pi
is a constant. It is actually a variable which is set to 3.141593 when R is started, but can be changed to any value you like2. If you find that pi
or T
or F
has been changed from a default, and you want to have them return to the default state, you have a couple of choices. You can restart R by clicking on Session/Restart. This will do more than just reset variables to their default values, it will reset R to its start-up state. Or, you can remove a variable from the R environment by using the function rm()
. The function rm
accepts the name of a variable and removes it from memory. As an example, look at the code below:
## [1] 3.141593
## [1] 3.2
## [1] 3.141593
We end this section with a couple more hints. If you want to remove all of the variables from your working environment, you can click on the broom icon in the Environment tab in RStudio. You may want to do this from time to time, as R can slow down when it has too many large variables in the current environment.
If you have a longish variable name in your environment, then you can use the tab key to auto-complete.
1.2 Vectors
Data often takes the form of lists of values, rather than single values. We need to be able to store lists of values in order to be able to work with them. For this, we use vectors. A vector is a list of values (usually) of length bigger than one. (Formally, a list
is a different data type. Here, we just mean a finite sequence of values of the same type.)
1.2.1 Creating vectors
There are many ways to create vectors. Perhaps the easiest is:
## [1] 2 3 5 7 11
The c
is a function that combines the values given to it into a vector. In this case, the vector is the list of the first 5 prime numbers. We can store vectors in variables just like we did with numbers:
You can also create a vector of numbers in order using the :
operator:
## [1] 1 2 3 4 5 6 7 8 9 10
The rep
function is a more flexible way of creating vectors. It has a required argument x
, which is a vector of values that are to be repeated (could be a single value as well). It has optional arguments times
, length.out
and each
. Normally, just one of the additional arguments is specified, and we will not discuss what happens if multiple are specified. The argument times
specifies the number of times that the value in x
is repeated. This can either be specified as a single value, in which case the values of x
are repeated that many times, or as a vector values the same length as x
. In this case, the values in x
are repeated the number of times associated with the vector, but the ordering is different. For example, see the code below.
## [1] 2 3 2 3
## [1] 2 2 3 3
## [1] 2 2 3 3 3
Alternatively, you can specify the length of the vector that you are trying to obtain, using length.out
. R will truncate the last repetition of the vector that you are repeating to force the length to be exactly length.out
.
## [1] 2 3 2 3 2 3
## [1] 2 3 2
Finally, setting each
to a number repeats each value in x
the same number of times. However, it orders the values as if you had written a vector of values in times
.
## [1] "Bryan" "Bryan" "Darrin" "Darrin"
## [1] "Bryan" "Darrin" "Bryan" "Darrin"
We see here that the original vector x
does not have to be numeric!
One last useful function for creating vectors is seq
. This is a generalization of the :
operator described above. We will not go over the entire list of arguments associated with seq
, but we note that it has arguments from
, to
, by
and length.out
. We provide a couple of examples that we hope illustrate well enough how to use seq
.
## [1] 1 3 5 7 9 11
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
## [16] 8.5 9.0 9.5 10.0 10.5 11.0
Once you have created a vector, you may also want to do arithmetic or other operations on it. Most of the operations in R work “as expected” with vectors. Suppose you wanted to see what the square roots of the first 5 primes were. You might guess:
## [1] 1.414214 1.732051 2.236068 2.645751 3.316625
and you would be right! Returning to the cryptic manual entry in log
, we recall that it stated that x
is a numeric vector. This is the documentation’s way of telling us that log
is vectorized so that if we supply it with a vector of values, then log
will compute the log of each value separately. For example,
## [1] 0.6931472 1.0986123 1.6094379 1.9459101 2.3978953
Guess what would happen when you type primes + primes
, primes * primes
and primes & primes
. Were you right?
1.3 Indexing Vectors
To examine or use a single element in a vector, you need to supply its index. primes[1]
is the first element in the vector of primes, primes[2]
is the second, and so on.
## [1] 2
## [1] 3
You can do many things with indexes. For example, you can provide a vector of indices, and R will return a new vector with the values associated with those indices.
## [1] 2 3 5
You can remove a value from a vector by using a -
sign.
## [1] 3 5 7 11
You can provide a vector of TRUE
and FALSE
values as an index, and R will return the values that are associated with TRUE
. It is recommended as a beginner to have the length of the vector of TRUE
and FALSE
values be the same length as the original vector.
## [1] 2 5 11
The construct of providing a Boolean vector (that is, a vector containing TRUE
and FALSE
) for indexing is most useful for constructions like this. Suppose we wanted to “pull out” the values in primes
that are bigger than 6. We would first create an appropriate vector of TRUE
and FALSE
values, then index primes
by it.
## [1] FALSE FALSE FALSE TRUE TRUE
## [1] 7 11
R comes with many built-in data sets. For example, the rivers
dataset is a vector containing the length of major North American rivers. Try typing ?rivers
to see some more information about the data set. Let’s see what the data set contains.
## [1] 735 320 325 392 524 450 1459 135 465 600 330 336 280 315 870
## [16] 906 202 329 290 1000 600 505 1450 840 1243 890 350 407 286 280
## [31] 525 720 390 250 327 230 265 850 210 630 260 230 360 730 600
## [46] 306 390 420 291 710 340 217 281 352 259 250 470 680 570 350
## [61] 300 560 900 625 332 2348 1171 3710 2315 2533 780 280 410 460 260
## [76] 255 431 350 760 618 338 981 1306 500 696 605 250 411 1054 735
## [91] 233 435 490 310 460 383 375 1270 545 445 1885 380 300 380 377
## [106] 425 276 210 800 420 350 360 538 1100 1205 314 237 610 360 540
## [121] 1038 424 310 300 444 301 268 620 215 652 900 525 246 360 529
## [136] 500 720 270 430 671 1770
By typing ?rivers
, we learn that this data set gives the lengths (in miles) of 141 major rivers in North America, as compiled by the US Geological Survey. This data set is explored further in the exercises in this chapter. We will often want to examine only the first few elements when the data set is large. For that, we can use the function head
, which by default shows the first n = 6
elements.
## [1] 735 320 325 392 524 450 1459 135 465 600
Here are some functions that can be applied to vectors, which we believe are self-explanatory.
## [1] 135 202 210 210 215 217 230 230 233 237 246 250 250 250 255
## [16] 259 260 260 265 268 270 276 280 280 280 281 286 290 291 300
## [31] 300 300 301 306 310 310 314 315 320 325 327 329 330 332 336
## [46] 338 340 350 350 350 350 352 360 360 360 360 375 377 380 380
## [61] 383 390 390 392 407 410 411 420 420 424 425 430 431 435 444
## [76] 445 450 460 460 465 470 490 500 500 505 524 525 525 529 538
## [91] 540 545 560 570 600 600 600 605 610 618 620 625 630 652 671
## [106] 680 696 710 720 720 730 735 735 760 780 800 840 850 870 890
## [121] 900 900 906 981 1000 1038 1054 1100 1171 1205 1243 1270 1306 1450 1459
## [136] 1770 1885 2315 2348 2533 3710
The discoveries
data set is a vector containing the number of “great” inventions and scientific discoveries in each year from 1860 to 1959. Try ?discoveries
to see more information about the discoveries
data set. You might try the examples listed there just to see what they do, but we won’t be doing anything like that yet. Let’s see what the data set contains.
## Time Series:
## Start = 1860
## End = 1959
## Frequency = 1
## [1] 5 3 0 2 0 3 2 3 6 1 2 1 2 1 3 3 3 5 2 4 4 0 2 3 7
## [26] 12 3 10 9 2 3 7 7 2 3 3 6 2 4 3 5 2 2 4 0 4 2 5 2 3
## [51] 3 6 5 8 3 6 6 0 5 2 2 2 6 3 4 4 2 2 4 7 5 3 3 0 2
## [76] 2 2 1 3 4 2 2 1 1 1 2 1 4 4 3 2 1 4 1 1 1 0 0 2 0
If we type str(discoveries)
we see that the data set is stored as a time series rather than as a vector. For our purposes in this section, that will be an unimportant distinction, and we can simply think of the variable as a vector of numeric values.
The first 6 elements are:
## [1] 5 3 0 2 0 3
Here are a few more things you can do with a vector:
## discoveries
## 0 1 2 3 4 5 6 7 8 9 10 12
## 9 12 26 20 12 7 6 4 1 1 1 1
## [1] 12
## [1] 310
## [1] 6 7 12 10 9 7 7 6 6 8 6 6 6 7
## [1] 1868 1884 1885 1887 1888 1891 1892 1896 1911 1913 1915 1916 1922 1929
When table
is provided a vector, it provides a table of the number of occurrences of each value in the vector. It will not provide zeros for values that are not there, even if it seems “obvious” to a human that there might have been place for that value. The function which
accepts a vector of TRUE
and FALSE
values, and returns the indices in the vector that are TRUE
. So, in the last line of the code above, adding 1859 to the indices gives the years that had more than 5 great discoveries.
1.4 Data Types
There are several types of data that R understands. Data can be stored as numeric
, integer
, character
, factor
and logical
data types, among others. There are lots of issues and special cases that come up when dealing with these data types, and we do not plan to go over all of them here.
numeric
data is numerical data, including all real numbers. If you type x <- 2
, then x
will be stored as numeric
data. (You can test this by typing str(x)
.)
integer
data is data that is integers! If you type x <- 2L
, then x
will be stored as an integer3. When reading data in from files, R will store data that is all integer as an integer, unlike when you enter data in like x <- 2
. Again, str()
is your best friend here. For the purposes of this book, it will not be necessary to make a strong distinction between numerical data and integer data that is stored in R.
character
data is what many languages call strings. It is a collection of characters. If you type x <- "hello"
, then x
is a character
variable. Compare str("hello")
to str(c(1,2))
. Note that if you want to access the e
from hello
, you cannot use x[2]
. If you find yourself in the situation where you need to manipulate strings, we recommend using the stringr
package.
logical
data is TRUE
and FALSE
.
factor
data is common in statistics, but maybe not so commonly implemented and used in other programming languages. factor
data can take on values in a predefined set: the variable continetn
could be set up to allow only entries of Africa
, Antarctica
, Asia
, Australia
, Europe
, North America
or South America
, for example.
My experience has been that students underestimate the importance of knowing what type of data they are working with. R works really well when the data types are assigned properly. However, some bizarre things can occur when you try to force R to do something with a data type that is different than what you think it is! My strong suggestion is, whenever you examine a new data set (especially one that you read in from a file!), your first move is to use str()
on it, followed by head()
. Make sure that the data is stored the way you want before you continue with anything else.
1.4.1 Missing data
Missing data is a problem that comes up frequently, and R uses the special value NA
to represent it. NA
isn’t a data type, but a value that can take on any data type. It stands for Not Available, and it means that there is no data collected for that value.
As an example, consider the vector airquality$Ozone
, which is part of base R:
## [1] 41 36 12 18 NA 28 23 19 8 NA 7 16 11 14 18 14 34 6
## [19] 30 11 1 11 4 32 NA NA NA 23 45 115 37 NA NA NA NA NA
## [37] NA 29 NA 71 39 NA NA 23 NA NA 21 37 20 12 13 NA NA NA
## [55] NA NA NA NA NA NA NA 135 49 32 NA 64 40 77 97 97 85 NA
## [73] 10 27 NA 7 48 35 61 79 63 16 NA NA 80 108 20 52 82 50
## [91] 64 59 39 9 16 78 35 66 122 89 110 NA NA 44 28 65 NA 22
## [109] 59 23 31 44 21 9 NA 45 168 73 NA 76 118 84 85 96 78 73
## [127] 91 47 32 20 23 21 24 44 21 28 9 13 46 18 13 24 16 13
## [145] 23 36 7 14 30 NA 14 18 20
This shows the daily Ozone levels (ppb) in New York during the summer of 1973. Many days the ozone level was not recorded, and so this vector contains numerous NA
values. Most R functions will force you to decide what to do with missing values, rather than make assumptions. For example, to find the mean ozone level for the days with data, we must specify that the NA
values should be removed with the argument na.rm = TRUE
:
## [1] NA
## [1] 42.12931
1.5 Data Frames
Returning to the built-in data set rivers
, it would be very useful if the rivers
data set also had the names of the rivers also stored.
That is, for each river, we would like to know both the name of the river and the length of the river. We might organize the data by having one column, titled river
, that gave the name of the rivers, and another column, titled length
, that gave the length of the rivers.
This leads us to one of the most common data types in R, the data frame.
A data frame consists of a number of observations of variables. Some examples would be:
- The name and length of major rivers.
- The height, weight and blood pressure of a sample of healthy, adult females.
- The high and low temperature in St Louis, MO, for each day of 2016.
As a specific example, let’s look at the data set mtcars
, which is a predefined data set in R.
Start with str(mtcars)
. You can see that mtcars
consists of 32 observations of 11 variables. The variable names are mpg, cyl, disp
and so on. You can also type ?mtcars
on the console to see information on the data set. Some data sets have more detailed help pages than others, but it is always a good idea to look at the help page.
You can see that the data is from the 1974(!) Motor Trend magazine. You might wonder why we use such an old data set. In the R community, there are standard data sets that get used as examples when people create new code. The fact that familiar data sets are usually used lets people focus on the new aspect of the code rather than on the data set itself. In this course, we will do a mix of data sets; some will be up-to-date and hopefully interesting. Others will be so that you begin to familiarize yourself with the common data sets that developeRs use.
There are two ways to access the data in mtcars
. You can use $
notation or [ ]
notation. To examine the weights of the car, for example, we could do
## [1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070
## [13] 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435 3.840
## [25] 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780
Or, we could do mtcars[,"wt"]
or mtcars[,6]
. If we want to see what the third car’s weight is, we could use
## [1] 2.32
Or, we could use mtcars[3,6]
. If we want to form a new data frame, call it smallmtcars
,
that only contains the variables mpg
, cyl
and qsec
, we could use
smallmtcars <- mtcars[,c(1,2,7)]
or perhaps easier smallmtcars <- mtcars[,c("mpg", "cyl", "qsec")]
because we don’t have to figure out which columns correspond to those variables.
If we want to look at only the first 10 observations, we could use mtcars[1:10,]
.
We can also select observations of the data that satisfies certain properties.
For example, if we want to pull out all observations that get more than 25 miles per gallon, then we could use mtcars[mtcars$mpg > 25,]
.
In order to test equality of two values, you use ==
. For example, in order to see which cars have 2 carburetors, we can use mtcars[mtcars$carb == 2,]
.
Finally, to combine multiple conditions, you can use the vector logical operators &
for and and |
, for or.
As an example, to see which cars either have 2 carburetors or 3 forward gears (or both), we would use mtcars[mtcars$carb == 2 | mtcars$gear == 3,]
.
There are several exercises below which will allow you to practice manipulating data frames.
In Chapter 5, we will introduce dplyr
tools which we will use to do more advanced manipulations,
but it is good to be able to do basic things with [,]
and $
as well.
Example
The airquality
data frame is part of base R, and it gives air quality measurements for New York City in the summer of 1973.
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
From the structure, we see that airquality
has 153 observations of 6 variables. The Wind
variable is numeric and the others are integers.
We now find the hottest temperature recorded, the average temperature in June, and the day with the most wind:
## [1] 97
## [1] 79.1
## Ozone Solar.R Wind Temp Month Day
## 48 37 284 20.7 72 6 17
It got to \(97^\circ\) at some point, it averaged \(79.1^\circ\) in June, and June 17 was the windiest day that summer.
1.6 Reading data from files
Loading data into R is one of the most important things to be able to do. If you can’t get R to load your data, then it doesn’t matter what kinds of neat tricks you could have done. It can also be one of the most frustrating things - not just in R, but in general. Your data might be on a web page, in an Excel spreadsheet, or in any one of dozens of other formats each with its own idiosyncrasies. R has powerful packages that can deal with just about any format of data you are likely to encounter, but for now we will focus on just one format, the CSV file. Usually, data in a CSV file will have the extension “.csv” at the end of its name. CSV stands for “Comma Separated Values” and means that the data is stored in rows with commas separating the variables. For example, CSV formatted data might look like this:
"Gender","Body.Temp","Heart.Rate"
"Male",96.3,70
"Male",96.7,71
"Male",96.9,74
"Female",96.4,69
"Female",96.7,62
This would mean that there are three variables: Gender
, Body.Temp
and Heart.Rate
. There are 5 observations; 3 males and 2 females. The first male had a body temperature of 96.3 and a heart rate of 70.
The command to read a CSV file into R is read.csv
. It takes one argument, a string giving the path to the file on your computer. R always has a working directory, which you can find with the getwd()
command, and you can see with the Files tab in RStudio. If your file is stored in that directory you can read it with the command read.csv("mydatafile.csv")
.
More advanced users may want to set up a file structure that has data stored in a separate folder, in which case they must specify the pathname to file they want to load. The easiest way to find the full pathname to a file is with the command file.choose()
, which will open an interactive dialog where you can find and select the file. Try this from the R console and you will see the full path to the file, which you can then use as the argument to read.csv
. Using file.choose()
requires you, the programmer, to provide input. One of the main reasons to use R is that analysis with R is reproducible, and can be performed without user intervention, so using interactive functions means your analysis will not be reproducible.
If you’ve tried read.csv
you may have noticed that it printed the contents of your CSV file to the console. To actually use the data, you need to store it in a variable as a data frame. Try to choose a name that is descriptive of the actual contents of your data file. For example, to load the file normtemp.csv
, which contains the gender, body temperature and heart rate data mentioned above, you would type temp_data <- read.csv("normtemp.csv")
, or provide the full path name.
In other instances, the CSV file that you want to read is hosted on a web page. In this case, it is sometimes easier to read the file directly from the web page by using read.csv("http://website.csv")
. As an example, there is a csv hosted at http://stat.slu.edu/~speegle/_book_data/stlTempData.csv, which you can load by typing stlTemp <- read.csv("http://stat.slu.edu/~speegle/_book_data/stlTempData.csv")
.
I can’t emphasize enough the importance of looking at your data after you have loaded it. Start by using str()
, head()
and summary()
on your variable after reading it in. As often as not, there will be something you will need to change in the data frame.
Finally, you can also write R data frames to a CSV file, in order to share with other people. If you have a data frame that you wish to store as a .csv file, you use the write.csv()
command. If your row names are not meaningful, then often you will want to add row.names = FALSE
. The command write.csv(mtcars, "mtcars_file.csv", row.names = FALSE)
writes the variable mtcars
to the file mtcars_file.csv
, which is again stored in the directory specified by getwd()
by default.
1.7 Packages
When you first start using R, the commands and data available to you are called “Base R” . The R language is extensible, which means that over the years people have added functionality. New functionality comes in the form of a package, which may be included in your R distribution or which you may need to install. For example, the HistData
package contains a few dozen data sets with historical significance.
Happily, installing packages is extremely simple: in R Studio you can click the Packages tab in the lower right panel, and then hit the Install button to install any package you need. Alternatively, you can use the install.packages
command, like so:
Installing packages does require an internet connection, and frequently when you install one package R will automatically install other packages, called dependencies, that the package you want must have to work.
Package installation is not a common operation. Once you have installed a package, you have it forever4. However, each time you want to use the contents of a package, you need to tell R to load it. You do this with the library
command:
Once you have loaded the package, the contents of the package are available to use. HistData contains a data set DrinksWages
with Elderton and Pearson’s 1910 data on drinking and earned wages. After loading HistData you can inspect DrinksWages and learn that rivetters were paid well in 1910:
## class trade sober drinks wage n
## 1 A papercutter 1 1 24.00000 2
## 2 A cabmen 1 10 18.41667 11
## 3 A goldbeater 2 1 21.50000 3
## 4 A stablemen 1 5 21.16667 6
## 5 A millworker 2 0 19.00000 2
## 6 A porter 9 8 20.50000 17
## class trade sober drinks wage n
## 64 C rivetter 1 0 40 1
Some packages are large, and we only require one small part of them. In that case, we use the ::
double colon operator to refer to the object required without loading the entire package. For example, MASS::anorexia
can access the anorexia
data from the MASS
package without loading the large and messy MASS
package into your workspace:
## Treat Prewt Postwt
## 1 Cont 80.7 80.2
## 2 Cont 89.4 80.1
## 3 Cont 91.8 86.4
## 4 Cont 74.0 86.3
## 5 Cont 78.1 76.1
## 6 Cont 88.3 78.1
Learning R with this book will require you to use a variety of packages. Though you need only install each package one time, you will need to use the ::
operator or load it with library
each time you start a new R session. One of the more common errors you will encounter is: Error: object 'so-and-so' not found
, which may mean that so-and-so
was part of a package you forgot to load.
1.8 Errors and Warnings
R, like most programming languages, is very picky about the instructions you give it. It pays attention to uppercase and lowercase
characters, similar looking symbols like =
and ==
mean very different things, and every bit of punctuation is important.
When you make mistakes (called bugs) in your code, a few things may happen: errors, warnings, and incorrect results. Code that runs but runs incorrectly is usually the hardest problem to fix, since the computer sees nothing wrong with your code and debugging is left entirely to you. The simplest bug is when your code produces an error. Here are a few examples.
## Error in mean(primse): object 'primse' not found
## Error in `[.data.frame`(mtcars, , 100): undefined columns selected
## Error: <text>:1:29: unexpected '='
## 1: airquality[airquality$Month =
## ^
The first is a typical spelling error. In the second we asked for column 100 of mtcars
, which has only 11 columns. In the third,
we used =
instead of ==
. You will encounter these sorts of errors all the time and then quickly graduate to much more subtle bugs.
Warnings occur when R detects a potential problem in your code but can continue working. For example, here we try to assign an entire vector to one element of a vector. R cannot do this, so it assigns the first element of the vector and prints a warning message.
## Warning in a[5] <- 100:200: number of items to replace is not a multiple of
## replacement length
## [1] 1 2 3 4 100 6 7 8 9 10
Complicated statistical operations such as hypothesis tests and regression analysis frequently produce warnings or messages
that the user might not care about. The output of R commands in this book will sometimes omit these messages to save space
and focus attention on the important part of the output. If you notice your command producing warnings not shown in the book,
either ignore them or dig deeper and learn a little more about R.
In your own code, you can use the commands suppressWarnings
and suppressMessages
to remove extraneous output for presentation quality work.
Another pitfall that traps many novice R users is the +
prompt. Working interactively in the console, you hit return and
are given a +
instead of the friendly >
prompt.
This means that the command you typed was incomplete, probably because you opened a parenthesis ‘(’ and failed to close it with ‘)’.
Sometimes this behavior is desirable, allowing a long command to extend over two lines. More often, the +
is unexpected. You can
escape from this situation with the escape key, ESC
, hence its name.
1.9 Useful Idioms
Here is a summary list of useful programming idioms that we will use throughout the textbook, for ease of future reference. We assume vec
is a numeric or integer vector.
sum(vec == 3)
counts the number of times that the value 3 occurs in the vectorvec
.mean(vec == 3)
gives the percentage of times that the value 3 occurs in the vectorvec
.table(vec)
counts the number of times that each value occurs invec
.max(table(vec))
allows user to see which value occurs most frequently invec
.length(unique(vec))
counts the number of distinct values that occur invec
.vec[vec > 0]
creates a new vector that only includes values invec
that are positive.vec[!is.na(vec)]
creates a new vector that only includes the non-missing values invec
.
Vignette: Data science communities
If you are serious about learning data science, the best thing you can do is to practice with real data. Finding appropriate data to practice on can be a challenge for beginners, but happily the R world abounds with online communities that share interesting data.
Both beginners and experts post visualizations, example code, and discussions of data from these sources regularly. Look at other developeRs code and decide what you like, and what you don’t. Incorporate their ideas into your own work!
Kaggle https://kaggle.com
A website that requires no cost registration. The Datasets section of Kaggle allows users to explore, analyze, and share quality data. Most datasets are clearly licensed for use, available in .csv format, and come with a description that explains the data. Each dataset has a discussion page where users can provide commentary and analysis.
Beyond data, Kaggle hosts machine learning competitions for competitors at all levels of expertise. Kaggle also offers R notebooks for cloud based computing and collaboration.
Tidy Tuesday Twitter: #TidyTuesday
A project that arose from the R4DS Learning Community.
The project posts a new dataset each Tuesday. Datasets are suggested by the community and curated by the Tidy Tuesday
organizers. Tidy Tuesday datasets emphasize understanding how to summarize and arrange data to make meaningful visualizations
with ggplot2
, tidyr
, dplyr
, and other tools in the tidyverse ecosystem.
Data scientists post their visualizations and code on Twitter. Tidy Tuesday data is available through a GitHub repository or with the R package tidytuesdayR
.
Data Is Plural https://tinyletter.com/data-is-plural
A weekly newsletter of useful and curious datasets, maintained by Jeremy Singer-Vine. Datasets are well curated and come with source links. There is a shared spreadsheet with an archive of past data.
Stack Overflow https://stackoverflow.com
A community Q&A forum for every computer language and a few other things besides.
It has over 300,000 questions tagged r
. If you ask a search engine a question about R, you will likely be directed to StackOverflow.
If you can’t find an answer already posted, create a free account and ask the question yourself.
It is common to get expert answers within hours.
R Specific Groups https://rladies.org/ and https://jumpingrivers.github.io/meetingsR/r-user-groups.html
Both of these groups support R users with educational opportunities. R Ladies is an organization that promotes gender diversity in the R community. The also hold meetups in various locations around the world to get people excited about using R. UseR groups primarily host meetups where they discuss various aspects of R, from beginning to advanced. If you find yourself wanting to get connected to the larger R community, these are good places to start.
Exercises
-
Let
x <- c(1,2,3)
andy <- c(6,5,4)
. Predict what will happen when the following pieces of code are run. Check your answer.x * 2
x * y
x[1] * y[2]
-
Let
x <- c(1,2,3)
andy <- c(6,5,4)
. What is the value ofx
after each of the following commands? (Assume that each part starts with the values ofx
andy
given above.)x + x
x <- x + x
y <- x + x
x <- x + 1
-
Determine the values of the vector
vec
after each of the following commands is run.vec <- 1:10
vec <- 1:10 * 2
vec <- 1:10^2
vec <- 1:10 + 1
vec <- 1:(10 * 2)
vec <- rep(c(1,1,2), times = 2)
vec <- seq(from = 0, to = 10, length.out = 5)
-
Use R to calculate the sum of the squares of all numbers from 1 to 100: \(1^2 + 2^2 + \dotsb + 99^2 + 100^2\)
-
Let
x
be the vector obtained by running the R commandx <- seq(from = 10, to = 30, by = 2)
.- What is the length of
x
? (By length, we mean the number of elements in the vector. This can be obtained using thestr
function or thelength
function.) - What is
x[2]
? - What is
x[1:5]
? - What is
x[1:3*2]
? - What is
x[1:(3*2)]
? - What is
x > 25
? - What is
x[x > 25]
? - What is
x[-1]
? - What is
x[-1:-3]
?
- What is the length of
-
In this exercise, you will graph the function \(f(p) = p(1-p)\) for \(p \in [0,1]\).
- Use
seq
to create a vectorp
of numbers from 0 to 1 spaced by 0.2. - Use
plot
to plotp
in thex
coordinate andp(1-p)
in they
coordinate. Read the help page forplot
and experiment with thetype
argument to find a good choice for this graph.
- Use
-
Consider the built-in data frame
airquality
.- How many observations of how many variables are there?
- What are the names of the variables?
- What type of data is each variable?
- Do you agree with the data type that has been given to each variable? What would have been some alternative choices?
-
R has a built-in vector
rivers
which contains the lengths of major North American rivers.- Use
?rivers
to learn about the data set. - Find the mean and sd of the rivers data.
- Make a histogram (
hist
) of the rivers data. - Get the five number summary (
summary
) of rivers data. - Find the longest and shortest lengths of rivers in the set.
- Make a list of all (the lengths of the) rivers longer than 1000 miles.
- Use
-
There is a built in data set
state
, which is really seven separate variables with names such asstate.name
,state.region
, andstate.area
.- What are the possible regions a state can be in? How many states are in each region?
- Which states have area less than 10000 square miles?
- Which state’s geographic center is furthest south? (Hint: use
which.min
)
-
Consider the
mtcars
data set.- Which cars have 4 forward gears?
- What subset of
mtcars
doesmtcars[mtcars$disp > 150 & mtcars$mpg > 20,]
describe? - Which cars have 4 forward gears and manual transmission? (Note: manual transmission is 1 and automatic is 0.)
- Which cars have 4 forward gears or manual transmission?
- Find the mean mpg of the cars with 2 carburetors.
-
In the text, we loaded the data at http://stat.slu.edu/~speegle/_book_data/stlTempData.csv by reading it directly from the web site. For large files, this can be time-consuming to do every time, and it also requires you to always have an internet connection when you want to use that data. Load the data set contained here by first downloading it onto your machine, putting it in the correct directory, and using
read.csv
. -
This problem uses the data set
DrinksWages
from the packageHistData
.- How many observations of how many variables are there? What types are the variables?
- The variable
wage
contains the average wage for each profession. Which profession has the lowest wage? - The variable
n
contains the number of workers surveyed for each profession. Sum this to find the total number of workers surveyed. - Compute the mean wage for all workers surveyed by multiplying wage * n for each profession, summing, and dividing by the total number of workers surveyed.
-
This problem uses the package
Lahman
which you will probably need to install. Consider the data setBatting
, which should now be available. It contains batting statistics of all major league players broken down by season since 1871. We will be using this data set some more in the data wrangling chapter of this book.- How many observations of how many variables are there?
- Use the command
head(Batting)
to get a look at the first six lines of data. - What is the most number of triples (X3B) that have been hit in a single season?
- What is the playerID(s) of the person(s) who hit the most number of triples in a single season? In what year did it happen?
- Which player hit the most number of triples in a single season since 1960?
This is not a completely uncontroversial statement. Many R users prefer to use
=
, and it is one of those things that you just can’t reason about. The stackoverflow question What are the differences between=
and<-
in R? has over 191K views as of this writing.↩︎perhaps to 3.2, if you are Edward J. Goodwin trying to enact the “Indiana Pi Bill”.↩︎
Why L? We don’t know, but some have indicated that it comes from Long integers, a reference to the number of bits used to store R integers↩︎
Well, at least until you update R to the newest version, which cleans out the packages that you had previously installed.↩︎