5 Data Manipulation
Data is organized in many different ways. In order to visualize and analyze it, we often need to transform data into a format that is more amenable to R. We encourage data to be organized in a “tidy” way. The data should be rectangular, where each row corresponds to one observation, and each column corresponds to one variable observed. Once data is in tidy form, it can require effort to answer relevant questions. Finally, it is often necessary to combine data from multiple sources.
In this section, we discuss the basics of data manipulation using the library dplyr
. We also introduce joining data sets and basic string operations. We will be using a data set that provides ratings of movies throughout this chapter. One challenge for beginners is finding appropriate data sets to practice on. We recommend the data sets at tidy tuesday as a source of many excellent data sets. It is also possible (and encouraged) to search for examples of how other people have approached the data sets provided on tidy tuesdays. One can find many examples from both beginners and experts on how to wrangle and visualize the data. You should look at other’s code with and decide what you like, and what you don’t. Incorporate their ideas into your own work! The exercises at the end of the chapter encourage you to explore some of the data sets in a more structured way.
5.1 Data manipulation
See also the introduction to dplyr vignette for more information. Let’s consider the movies
data set which we load in the following way.
movies <- read.csv("http://stat.slu.edu/~speegle/_book_data/movieLensData", as.is = TRUE)
The data set movies
consists of 100,000 observations of the variables MovieID, Title, Genres, UserID, Rating and Timestamp. This data is from a much larger data set, MovieLens, freely available from GroupLens Research. The movies
data set contains 746 consecutive users extracted from MovieLens, beginning at a random starting point.
Let’s begin by converting the data fame to a tibble
.
movies <- as_tibble(movies)
Now, we can just type movies
to get an overview of what is contained in the tibble movies. I like using tibbles because if I accidentally (or on purpose) type the name of a tibble, the output is presented in a usable form. I will use the word “data frame” below, even though I will typically be working with tibbles.
We will be interested in the following commands:
select()
forms a new data frame with selected columns.arrange()
forms a new data frame with row(s) arranged in a specified order.filter()
forms a new data frame consisting of rows that satisfy certain filtering conditions.mutate()
andtransmute()
allow you to create new columns out of a data frame. mutate adds to the data frame, and transmute creates a new data frame.summarize()
summarizes a data frame into a single row.distinct()
collapses the identical observations into a single one.group_by()
groups the data to perform tasks by groups.
Let’s see how to use some of these with our data frame movies
. You should try all of these commands to see what they do. (Be sure to use as_tibble
first so you can get a good look at the output!)
- Create a new data frame with the columns MovieID and Rating.
select(movies, Rating, MovieID)
. - Order the ratings by the time at which they were reviewed.
arrange(movies, Timestamp)
- Find all 5 star ratings.
filter(movies, Rating == 5)
- Find the mean of all Ratings in the data frame.
summarize(movies, mean(Rating))
- Form a data frame consisting of the unique User ID’s.
distinct(movies, UserID)
- We will save
mutate()
andtransmute()
for a different data set. group_by()
really only works well when combined with other commands.head
isn’t adplyr
command, but can be used to view a specified number of rows of a data frame.
The utility of these commands is more obvious when we start to combine them. The key tool for combining the commands is the pipe operator, %>%
. The pipe operator allows you to feed the result of one expression as input to a function. For example, x %>% f(y) %>% g(z)
becomes g(f(x,y),z)
.
Example Find the mean4 rating of Toy Story, which has MovieID 1.
filter(movies, MovieID == 1) %>%
summarize(mean(Rating))
## # A tibble: 1 x 1
## `mean(Rating)`
## <dbl>
## 1 3.88
The filter()
command creates a data frame that consists solely of the observations of Toy Story, and summarize()
computes the mean rating.
Example Create a data frame consisting of the observations associated with movies that have an average rating of 5 stars. That is, each rating of the movie was 5 stars.
In order to do this, we will use the group_by()
function to find the mean rating of each movie.
movies %>%
group_by(Title) %>%
summarize(Rating = mean(Rating))
## # A tibble: 6,200 x 2
## Title Rating
## <chr> <dbl>
## 1 ...All the Marbles (a.k.a. The California Dolls) (1981) 3.5
## 2 ...And God Spoke (1993) 4
## 3 ...And Justice for All (1979) 3.6
## 4 *batteries not included (1987) 3.12
## 5 10 (1979) 3.5
## 6 10 Items or Less (2006) 3
## 7 10 Rillington Place (1971) 4.25
## 8 10 Things I Hate About You (1999) 3.45
## 9 10 to Midnight (1983) 2.5
## 10 101 Dalmatians (1996) 3.19
## # … with 6,190 more rows
Now, we could sort those in decreasing order to see ones that have a mean rating of 5.
movies %>%
group_by(Title) %>%
summarize(Rating = mean(Rating)) %>%
arrange(desc(Rating))
## # A tibble: 6,200 x 2
## Title Rating
## <chr> <dbl>
## 1 49 Up (2005) 5
## 2 5,000 Fingers of Dr. T, The (1953) 5
## 3 Anastasia (1956) 5
## 4 Ay, Carmela! (¡Ay, Carmela!) (1990) 5
## 5 Ballad of Narayama, The (Narayama bushiko) (1983) 5
## 6 Band of the Hand (1986) 5
## 7 Beerfest (2006) 5
## 8 Big Clock, The (1948) 5
## 9 Boss of It All, The (Direktøren for det hele) (2006) 5
## 10 Boys (1996) 5
## # … with 6,190 more rows
Now, we can filter out those whose mean rating is 5. Note, it is not necessary to sort first.
movies %>%
group_by(Title) %>%
summarize(Rating = mean(Rating)) %>%
filter(Rating == 5)
## # A tibble: 113 x 2
## Title Rating
## <chr> <dbl>
## 1 49 Up (2005) 5
## 2 5,000 Fingers of Dr. T, The (1953) 5
## 3 Anastasia (1956) 5
## 4 Ay, Carmela! (¡Ay, Carmela!) (1990) 5
## 5 Ballad of Narayama, The (Narayama bushiko) (1983) 5
## 6 Band of the Hand (1986) 5
## 7 Beerfest (2006) 5
## 8 Big Clock, The (1948) 5
## 9 Boss of It All, The (Direktøren for det hele) (2006) 5
## 10 Boys (1996) 5
## # … with 103 more rows
And we see that there are 113 movies which have mean rating of 5.
Example Find the lowest rated movie.
movies %>%
group_by(Title) %>%
summarize(mr = mean(Rating)) %>%
arrange(mr)
## # A tibble: 6,200 x 2
## Title mr
## <chr> <dbl>
## 1 Avventura, L (a.k.a. Adventure, The) (1960) 0.5
## 2 Big Combo, The (1955) 0.5
## 3 BloodRayne (2005) 0.5
## 4 Bratz: The Movie (2007) 0.5
## 5 Care Bears Movie II: A New Generation (1986) 0.5
## 6 Chaos (2001) 0.5
## 7 Chinese Ghost Story, A (Sinnui yauwan) (1987) 0.5
## 8 Devil Doll (1964) 0.5
## 9 Diary of a Mad Black Woman (2005) 0.5
## 10 Gate II: Trespassers, The (1990) 0.5
## # … with 6,190 more rows
Yikes! 0.5 out of 5 stars!
To see which movie that received 5 stars has the most ratings, we would do:
movies %>%
group_by(Title) %>%
summarize(mr = mean(Rating), numRating = n()) %>%
arrange(desc(mr), desc(numRating))
## # A tibble: 6,200 x 3
## Title mr numRating
## <chr> <dbl> <int>
## 1 Lady Eve, The (1941) 5 4
## 2 Frances (1982) 5 3
## 3 Boys (1996) 5 2
## 4 Denise Calls Up (1995) 5 2
## 5 Manny & Lo (1996) 5 2
## 6 Pier, The (La Jetée) (1962) 5 2
## 7 Tin Drum, The (Blechtrommel, Die) (1979) 5 2
## 8 Tinker, Tailor, Soldier, Spy (1979) 5 2
## 9 Trouble in Paradise (1932) 5 2
## 10 49 Up (2005) 5 1
## # … with 6,190 more rows
And, we see that The Lady Eve had 4 ratings. Never heard of it. If you are interested, IMDB says “A trio of classy card sharps targets the socially awkward heir to brewery millions for his money, until one of them falls in love with him.” Out of curiosity, what is The Lady Eve’s movie ID.
movies %>%
filter(Title == "Lady Eve, The (1941)") %>%
select(MovieID) %>%
distinct()
## # A tibble: 1 x 1
## MovieID
## <int>
## 1 2935
Next, out of movies with a lot of ratings, which has the highest rating? Well, we need to decide what a lot of ratings means. Let’s see how many ratings some of the movies had.
movies %>%
group_by(Title) %>%
summarize(count = n()) %>%
arrange(desc(count))
## # A tibble: 6,200 x 2
## Title count
## <chr> <int>
## 1 Silence of the Lambs, The (1991) 368
## 2 Pulp Fiction (1994) 359
## 3 Forrest Gump (1994) 349
## 4 Shawshank Redemption, The (1994) 344
## 5 Jurassic Park (1993) 338
## 6 Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977) 323
## 7 Fugitive, The (1993) 305
## 8 Terminator 2: Judgment Day (1991) 296
## 9 Independence Day (a.k.a. ID4) (1996) 295
## 10 Apollo 13 (1995) 291
## # … with 6,190 more rows
If we want a list of the 5 most rated movies, we could do:
movies %>%
group_by(Title) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
head(n = 5)
## # A tibble: 5 x 2
## Title count
## <chr> <int>
## 1 Silence of the Lambs, The (1991) 368
## 2 Pulp Fiction (1994) 359
## 3 Forrest Gump (1994) 349
## 4 Shawshank Redemption, The (1994) 344
## 5 Jurassic Park (1993) 338
So, it seems like 200 ratings could classify as “a lot”. Let’s see which movie with at least 200 ratings has the highest mean rating.
movies %>%
group_by(Title) %>%
summarize(count = n(), meanRating = mean(Rating)) %>%
filter(count > 200) %>%
arrange(desc(meanRating))
## # A tibble: 40 x 3
## Title count meanRating
## <chr> <int> <dbl>
## 1 Shawshank Redemption, The (1994) 344 4.44
## 2 Godfather, The (1972) 228 4.43
## 3 Schindlers List (1993) 285 4.40
## 4 Usual Suspects, The (1995) 253 4.39
## 5 Silence of the Lambs, The (1991) 368 4.29
## 6 Matrix, The (1999) 246 4.26
## 7 American Beauty (1999) 258 4.25
## 8 Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) … 323 4.25
## 9 Raiders of the Lost Ark (Indiana Jones and the Raiders… 228 4.23
## 10 Sixth Sense, The (1999) 204 4.20
## # … with 30 more rows
I have to say, that’s not a bad list of movies. If we want to see all of the movies in this list, we can pipe to print
, and specify the number of movies we wish to see. (If you definitely want to see all of the movies, you can pipe to print(n = nrow(.))
)
movies %>%
group_by(Title) %>%
summarize(count = n(), meanRating = mean(Rating)) %>%
filter(count > 200) %>%
arrange(desc(meanRating)) %>%
select(Title, meanRating) %>%
print(n = 40)
## # A tibble: 40 x 2
## Title meanRating
## <chr> <dbl>
## 1 Shawshank Redemption, The (1994) 4.44
## 2 Godfather, The (1972) 4.43
## 3 Schindlers List (1993) 4.40
## 4 Usual Suspects, The (1995) 4.39
## 5 Silence of the Lambs, The (1991) 4.29
## 6 Matrix, The (1999) 4.26
## 7 American Beauty (1999) 4.25
## 8 Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977) 4.25
## 9 Raiders of the Lost Ark (Indiana Jones and the Raiders of th… 4.23
## 10 Sixth Sense, The (1999) 4.20
## 11 Star Wars: Episode V - The Empire Strikes Back (1980) 4.20
## 12 Fargo (1996) 4.10
## 13 Fugitive, The (1993) 4.09
## 14 Pulp Fiction (1994) 4.08
## 15 Saving Private Ryan (1998) 4.05
## 16 Braveheart (1995) 4.04
## 17 Seven (a.k.a. Se7en) (1995) 4.04
## 18 Star Wars: Episode VI - Return of the Jedi (1983) 4.03
## 19 Forrest Gump (1994) 4.00
## 20 Apollo 13 (1995) 3.95
## 21 Terminator 2: Judgment Day (1991) 3.94
## 22 12 Monkeys (Twelve Monkeys) (1995) 3.93
## 23 Toy Story (1995) 3.88
## 24 Rock, The (1996) 3.81
## 25 Lion King, The (1994) 3.78
## 26 Back to the Future (1985) 3.77
## 27 Dances with Wolves (1990) 3.70
## 28 E.T. the Extra-Terrestrial (1982) 3.70
## 29 Aladdin (1992) 3.67
## 30 Beauty and the Beast (1991) 3.66
## 31 Jurassic Park (1993) 3.58
## 32 Die Hard: With a Vengeance (1995) 3.52
## 33 Speed (1994) 3.50
## 34 Independence Day (a.k.a. ID4) (1996) 3.48
## 35 True Lies (1994) 3.48
## 36 Mission: Impossible (1996) 3.43
## 37 Batman (1989) 3.42
## 38 Stargate (1994) 3.40
## 39 Twister (1996) 3.26
## 40 Ace Ventura: Pet Detective (1994) 3.08
You just got your summer movie watching list. You’re welcome.
Now, why don’t you try to figure out which genre had the highest rating, which movie was rated the most, and the 10 worst rated movies that were rated at least 100 times. The solutions are printed out below, so that you can check your answers.
## # A tibble: 644 x 2
## Genres GenreRating
## <chr> <dbl>
## 1 Comedy|Drama|Fantasy|Musical|Romance 5
## 2 Animation|Fantasy|Horror 4.67
## 3 Animation|Comedy|Fantasy|Musical 4.5
## 4 Animation|Drama 4.5
## 5 Comedy|Drama|Sci-Fi 4.5
## 6 Crime|Drama|Fantasy|Film-Noir|Horror 4.5
## 7 Crime|Drama|War 4.5
## 8 Documentary|Fantasy 4.5
## 9 Drama|Fantasy|Horror|Romance 4.5
## 10 Drama|Horror|Romance 4.5
## # … with 634 more rows
What kind of genre is Comedy|Drama|Fantasy|Musical|Romance? That is crazy. What movie could that be?
movies %>%
filter(Genres == "Comedy|Drama|Fantasy|Musical|Romance") %>%
select(Title) %>%
distinct()
## # A tibble: 1 x 1
## Title
## <chr>
## 1 Saddest Music in the World, The (2003)
The Saddest Music in the World. According to IMDB, “A musical of sorts set in Winnipeg during the Great Depression, where a beer baroness organizes a contest to find the saddest music in the world. Musicians from around the world descend on the city to try and win the $25,000 prize.”
## # A tibble: 6,200 x 2
## Title numRating
## <chr> <int>
## 1 Silence of the Lambs, The (1991) 368
## 2 Pulp Fiction (1994) 359
## 3 Forrest Gump (1994) 349
## 4 Shawshank Redemption, The (1994) 344
## 5 Jurassic Park (1993) 338
## 6 Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977) 323
## 7 Fugitive, The (1993) 305
## 8 Terminator 2: Judgment Day (1991) 296
## 9 Independence Day (a.k.a. ID4) (1996) 295
## 10 Apollo 13 (1995) 291
## # … with 6,190 more rows
## # A tibble: 195 x 3
## Title MeanRating numRating
## <chr> <dbl> <int>
## 1 Judge Dredd (1995) 2.42 107
## 2 Honey, I Shrunk the Kids (1989) 2.64 105
## 3 Waterworld (1995) 2.72 180
## 4 Ace Ventura: When Nature Calls (1995) 2.73 124
## 5 Blair Witch Project, The (1999) 2.73 125
## 6 Armageddon (1998) 2.77 129
## 7 Batman Forever (1995) 2.89 197
## 8 Addams Family Values (1993) 2.94 115
## 9 Charlies Angels (2000) 2.95 100
## 10 Demolition Man (1993) 3.02 126
## # … with 185 more rows
5.2 Data Manipulation using strings
In this section, we use the stringr
package to do some basic string manipulation on the movies
data set. As far as R is concerned, strings are char
variable types, for example:
my_name <- "Darrin Speegle"
str(my_name)
## chr "Darrin Speegle"
Recall that str
returns the str
ucture of the variable, together with the contents if it is not too long. Even finding out how long a string is can be challenging for newcomers, because length
gives the number of strings in the variable my_name
:
length(my_name)
## [1] 1
One might very reasonably be interested in the number of spaces (a proxy for the number of words), the capital letters (a person’s initials), or in sorting a vector of characters containing names by the last name of the people. All of these tasks (and more) are made easier using the stringr
package.
We will use the functions str_length
, str_count
, str_remove
and str_extract
, str_detect
and str_split
. The function str_length
accepts one argument, a string, which can either be a single string or a vector of strings. It returns the length(s) of the string(s). For example,
stringr::str_length(my_name)
## [1] 14
If we had a vector of two names, then it works like this:
our_names <- c("Darrin Speegle", "Bryan Clair")
str_length(our_names)
## [1] 14 11
Next, we consider the function str_count
, which takes two parameters, string
and pattern
. The first is a string, and the second is a regular expression that you use to indicate the pattern that you are counting. To count the number of spaces, we will pass a single character to the pattern, namely the space character.
stringr::str_count(our_names, " ")
## [1] 1 1
We see that there is one space in each of the two strings in the variable our_names
. Regular expressions are very useful and powerful. If you end up doing much work with text data, you will not regret learning more about them. For example, if you want to match lower case vowels, you could use
str_count(our_names, "[aeiou]")
## [1] 5 3
Note the use of the grouping symbols [
and ]
. If we want to match any lower case letter, we use [a-z]
, any upper case letter is [A-Z]
, and any digit is [0-9]
. So, to count the number of upper case letters in the strings, we would use
str_count(our_names, "[A-Z]")
## [1] 2 2
Next, we examine str_extract
and its sibling function str_extract_all
. Suppose we want to extract the initials of the people in the vector our_names
. Our first thought might be to use str_extract
, which takes two parameters, string
which contains the string(s) we wish to extract from, and pattern
which is the pattern we wish to match, in order to extract.
stringr::str_extract(our_names, "[A-Z]")
## [1] "D" "B"
Note that this only extracts the first occurrence of the pattern in each string. To get all of them, we use str_extract_all
stringr::str_extract_all(our_names, "[A-Z]")
## [[1]]
## [1] "D" "S"
##
## [[2]]
## [1] "B" "C"
Note the format of the output. It is a list of character vectors. Let’s check using str
:
initials <- stringr::str_extract_all(our_names, "[A-Z]")
str(initials)
## List of 2
## $ : chr [1:2] "D" "S"
## $ : chr [1:2] "B" "C"
This is because str_extract_all
doesn’t know how many matches there are going to be in each string, and because it doesn’t know that we are thinking of these as initials and want them all in the same string. That is, we would like a vector that looks like c("DS", "BC")
. After learning about lapply
we could apply str_flatten
to each element of the list, but that seems like a long road to walk for this purpose. Perhaps easier is to do the opposite and remove all of the things that don’t match capital letters! To do this, we use str_remove_all
, which again takes two arguments, string
and pattern
to remove. We need to know that regular expressions allow ^
to indicate a pattern is anything that does not match what is written.
stringr::str_remove_all(our_names, "[^A-Z]")
## [1] "DS" "BC"
Now might be a good time to mention that there is an excellent cheat sheet on the stringr
package which includes basics on regular expressions here.
The last two functions that we are looking at are str_detect
and str_split
. These are most useful within a data analysis flow. The function str_detect
accepts string
and pattern
and returns TRUE
if the pattern is detected, and FALSE
if the pattern is not detected. For example, we could look for the pattern an
in our_names
:
stringr::str_detect(our_names, "an")
## [1] FALSE TRUE
Note that if we had put "an"
inside a bracket like this "[an]"
, then str_detect
would have looked for a pattern consisting of either an a
or an n
, and we would have received two TRUE
values. Finally, suppose we want to split the strings into first name and last name. We use str_split
or its sibling function str_split_fixed
. The function str_split
takes two arguments, a string and a pattern on which to split the string. Every time the function sees the pattern, it splits off a piece of the string. For example,
stringr::str_split(our_names, " ")
## [[1]]
## [1] "Darrin" "Speegle"
##
## [[2]]
## [1] "Bryan" "Clair"
Note that once again, the function returns a list of vectors of strings. If we know that we only want to split into at most 3 groups, say, then we can use str_split_fixed
to indicate that. The first two arguments are the same, but str_split_fixed
has a third argument, n
which is the number of strings that we want to split each string into.
stringr::str_split_fixed(our_names, " ", n = 3)
## [,1] [,2] [,3]
## [1,] "Darrin" "Speegle" ""
## [2,] "Bryan" "Clair" ""
Now, let’s apply this to the movies
data set. In particular, let’s consider the Genres
variable. What are the highest rated movies that have been rated at least 50 times and include “Comedy” in its list of Genres?
movies %>%
filter(str_detect(Genres, "Comedy")) %>%
group_by(Title) %>%
summarize(rating = mean(Rating), count = n()) %>%
filter(count >= 50) %>%
arrange(desc(rating)) %>%
print(n = 10)
## # A tibble: 207 x 3
## Title rating count
## <chr> <dbl> <int>
## 1 Wallace & Gromit: A Close Shave (1995) 4.42 83
## 2 Sting, The (1973) 4.34 89
## 3 Amelie (Fabuleux destin dAmélie Poulain, Le) (2001) 4.30 98
## 4 Wallace & Gromit: The Wrong Trousers (1993) 4.28 99
## 5 Wallace & Gromit: The Best of Aardman Animation (1996) 4.28 57
## 6 Dr. Strangelove or: How I Learned to Stop Worrying and Lov… 4.26 132
## 7 Postman, The (Postino, Il) (1994) 4.23 75
## 8 Life Is Beautiful (La Vita è bella) (1997) 4.23 111
## 9 One Flew Over the Cuckoos Nest (1975) 4.23 142
## 10 Cinema Paradiso (Nuovo cinema Paradiso) (1989) 4.23 59
## # … with 197 more rows
If we look at the movies
data set, we can see a couple of things that could be improved. The organization of the Genres
variable is not great, and makes it hard to work with. Also, the year that the movie was released is hidden inside of the title of the movie. Let’s pull out the year of release and move it into its own variable. Looking at the stringr cheat sheet, I see that we can anchor a pattern at the end of a string. It seems like the years are all at the end of a string, so let’s extract 4 digits, followed by a right parenthesis, at the end of a string. The pattern for this is “[0-9]{4}\)$”. The {4} indicates that we are matching the digit pattern four times, we have to excape the right parenthesis using \\
because )
is a reserved character in regular expressions, and the $
indicates that we are anchoring this to the end of the string. Let’s test it out:
str_extract(movies$Title, "[0-9]{4}\\)$") %>%
head()
## [1] "1995)" "1995)" "1995)" "1995)" "1995)" "1995)"
Hmmm, even better would be to get rid of the right parenthesis. We can use the look-ahead to only pull out 4 digits if they are followed by a right parenthesis and then the end of string.
str_extract(movies$Title, "[0-9]{4}(?=\\)$)") %>%
head()
## [1] "1995" "1995" "1995" "1995" "1995" "1995"
Sweet! Let’s convert to numeric and do a histogram, just to make sure no weird stuff snuck it.
str_extract(movies$Title, "[0-9]{4}(?=\\)$)") %>%
as.numeric() %>%
hist()
Looks pretty good. Let’s create a new variable called Year
that contains the year of release of the movie.
movies$Year <- str_extract(movies$Title, "[0-9]{4}(?=\\)$)") %>%
as.numeric()
summary(movies$Year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1915 1987 1994 1990 1998 2008
Now, what were the five highest rated movies of 1999 that had at least 50 ratings?
movies %>%
filter(Year == 1999) %>%
group_by(Title) %>%
summarize(mean = mean(Rating), count = n()) %>%
filter(count >= 50) %>%
arrange(desc(mean)) %>%
print(n = 5)
## # A tibble: 36 x 3
## Title mean count
## <chr> <dbl> <int>
## 1 Fight Club (1999) 4.30 177
## 2 Matrix, The (1999) 4.26 246
## 3 American Beauty (1999) 4.25 258
## 4 Sixth Sense, The (1999) 4.20 204
## 5 Green Mile, The (1999) 4.01 101
## # … with 31 more rows
5.3 (Optional) Using join to merge data frames
In this section, we briefly introduce the join
family of functions, by focusing on left_join
. The function left_join
joins two data frames together into one. The syntax is left_join(x, y, by = NULL)
, where x
and y
are data frames and by
is a list of columns that you want to join the data frames by (there are additional arguments that we won’t be using). The way I remember what left_join
does is that it adds the columns of y
to the columns of x
by matching on columns with the same names. Let’s look at some easy examples, to see what it is doing:
band_members
## # A tibble: 3 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
band_instruments
## # A tibble: 3 x 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
left_join(band_members, band_instruments)
## Joining, by = "name"
## # A tibble: 3 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
Now, what happens if Paul can learns how to play the drums as well?
band_instruments
## # A tibble: 4 x 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
## 4 Paul drums
left_join(band_members, band_instruments)
## Joining, by = "name"
## # A tibble: 4 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
## 4 Paul Beatles drums
We see that if there are multiple matches to name
, then left_join
makes multiple rows in the new data frame. Good, now, when would you use this? Well, many times you have data from multiple sources that you want to combine into one data frame. For example, consider the data sets 2012_alabama_election
, which you can download via and unemp
el <- read.csv("http://stat.slu.edu/~speegle/_book_data/2012_alabama_election.csv", stringsAsFactors = FALSE)
unemp <- read.csv("http://stat.slu.edu/~speegle/_book_data/unemp.csv", stringsAsFactors = FALSE)
(Note: unemp
is from the mapproj
package. You can also install that package and use data("unemp"))
.) These data sets give the by county election results from 2012 in Alabama, and the population and unemployment rate of all counties in the US. Suppose we want to create a new data frame that contains both the election results and the unemployment data. That is a job for left_join
.
combined_data <- left_join(el, unemp, by = c("FIPS" = "fips"))
Note that the two data frames didn’t have any column names in common, so I had to specify the columns that I wanted to join by.
head(combined_data)
## State.Postal County.Name FIPS Obama.vote X. Romney.vote X..1 pop
## 1 AL Alabama 0 793620 38.4 1252453 60.7 NA
## 2 AL Autauga 1001 6354 26.6 17366 72.6 23288
## 3 AL Baldwin 1003 18329 21.6 65772 77.4 81706
## 4 AL Barbour 1005 5873 51.3 5539 48.3 9703
## 5 AL Bibb 1007 2200 26.2 6131 73.1 8475
## 6 AL Blount 1009 2961 12.3 20741 86.5 25306
## unemp
## 1 NA
## 2 9.7
## 3 9.1
## 4 13.4
## 5 12.1
## 6 9.9
Word to the wise: when dealing with spatial data that is organized by county, it is much easier to use the fips code than the county name, because there are variances in capitalization and other issues that come up when trying to use county names. We will return to this combined data set in order to visualize it in Section @ref(#examples-ggplot)
5.4 Return to Batting exercise from Chapter Two
When we first considered the Batting
data set in the Lahman
library, we ignored the stint
variable. When a player played for more than one team in a season, the statistics of the player for each team is listed on separate lines. So, in chapter two, when you were asked to find the player that hit the most triples in one season, you should have added up all of the triples that each player hit for the various teams. We didn’t have the tools to do that at the time, but now we can.
Let’s begin by creating a data frame that has players’ ID, year ID and the number of triples that they hit in that year.
library(Lahman)
Batting %>%
group_by(playerID, yearID) %>%
summarize(X3B = sum(X3B))
## # A tibble: 97,992 x 3
## # Groups: playerID [19,428]
## playerID yearID X3B
## <chr> <int> <int>
## 1 aardsda01 2004 0
## 2 aardsda01 2006 0
## 3 aardsda01 2007 0
## 4 aardsda01 2008 0
## 5 aardsda01 2009 0
## 6 aardsda01 2010 0
## 7 aardsda01 2012 0
## 8 aardsda01 2013 0
## 9 aardsda01 2015 0
## 10 aaronha01 1954 6
## # … with 97,982 more rows
Now, we just need to find the largest number of triples.
Batting %>%
group_by(playerID, yearID) %>%
summarize(X3B = sum(X3B)) %>%
arrange(desc(X3B))
## # A tibble: 97,992 x 3
## # Groups: playerID [19,428]
## playerID yearID X3B
## <chr> <int> <int>
## 1 wilsoch01 1912 36
## 2 orrda01 1886 31
## 3 reitzhe01 1894 31
## 4 werdepe01 1893 29
## 5 davisha01 1897 28
## 6 thompsa01 1894 28
## 7 davisge01 1893 27
## 8 williji01 1899 27
## 9 crawfsa01 1914 26
## 10 cuyleki01 1925 26
## # … with 97,982 more rows
And, we see that the answer is still John Owen Wilson from 1912, just as it would have been without taking into account stints.
5.5 The apply family (Optional)
The focus of this book is on using dplyr
tools when manipulating data sets. However, there will be times when it will be useful to know the base R tools for doing some of the same tasks. At a minimum, it can be useful to know these tools when reading other people’s code! The apply
family consists of quite a few functions, including apply
, sapply
, lapply
, vapply
, tapply
, mapply
and even replicate
! In this section, we will focus on apply
and sapply
.
All of the functions in the apply
family are implicit loops, like replicate
. The typical usage of apply
is apply(X, MARGIN, FUN)
. The function apply
applies the function FUN
to either the rows or columns of X
, depending on whether MARGIN
is 1 (rows) or 2 (columns). So, the argument X
is typically a matrix or data frame, MARGIN
is either 1 or 2, and FUN
is a function, such as sum
or mean
.
As an example, consider the data set USJudgeRatings
, which contains the ratings of US judges by lawyers on various facets. Suppose you wanted to compute the mean rating of each judge. That is easy, using the apply
function.
head(USJudgeRatings)
## CONT INTG DMNR DILG CFMG DECI PREP FAMI ORAL WRIT PHYS RTEN
## AARONSON,L.H. 5.7 7.9 7.7 7.3 7.1 7.4 7.1 7.1 7.1 7.0 8.3 7.8
## ALEXANDER,J.M. 6.8 8.9 8.8 8.5 7.8 8.1 8.0 8.0 7.8 7.9 8.5 8.7
## ARMENTANO,A.J. 7.2 8.1 7.8 7.8 7.5 7.6 7.5 7.5 7.3 7.4 7.9 7.8
## BERDON,R.I. 6.8 8.8 8.5 8.8 8.3 8.5 8.7 8.7 8.4 8.5 8.8 8.7
## BRACKEN,J.J. 7.3 6.4 4.3 6.5 6.0 6.2 5.7 5.7 5.1 5.3 5.5 4.8
## BURNS,E.B. 6.2 8.8 8.7 8.5 7.9 8.0 8.1 8.0 8.0 8.0 8.6 8.6
apply(USJudgeRatings, 1, mean)
## AARONSON,L.H. ALEXANDER,J.M. ARMENTANO,A.J. BERDON,R.I.
## 7.291667 8.150000 7.616667 8.458333
## BRACKEN,J.J. BURNS,E.B. CALLAHAN,R.J. COHEN,S.S.
## 5.733333 8.116667 8.858333 5.458333
## DALY,J.J. DANNEHY,J.F. DEAN,H.H. DEVITA,H.J.
## 8.516667 7.891667 7.458333 7.125000
## DRISCOLL,P.J. GRILLO,A.E. HADDEN,W.L.JR. HAMILL,E.C.
## 7.366667 6.683333 7.850000 7.450000
## HEALEY.A.H. HULL,T.C. LEVINE,I. LEVISTER,R.L.
## 6.866667 7.400000 7.808333 6.608333
## MARTIN,L.F. MCGRATH,J.F. MIGNONE,A.F. MISSAL,H.M.
## 7.091667 6.783333 5.841667 7.458333
## MULVEY,H.M. NARUK,H.J. O'BRIEN,F.J. O'SULLIVAN,T.J.
## 8.450000 8.783333 7.941667 8.483333
## PASKEY,L. RUBINOW,J.E. SADEN.G.A. SATANIELLO,A.G.
## 8.066667 8.791667 7.775000 7.800000
## SHEA,D.M. SHEA,J.F.JR. SIDOR,W.J. SPEZIALE,J.A.
## 8.191667 8.500000 5.808333 8.183333
## SPONZO,M.J. STAPLETON,J.F. TESTO,R.J. TIERNEY,W.L.JR.
## 7.841667 7.683333 7.108333 7.983333
## WALL,R.A. WRIGHT,D.B. ZARRILLI,K.J.
## 7.016667 7.941667 7.425000
Then, to pull the highest mean rating:
max(apply(USJudgeRatings, 1, mean))
## [1] 8.858333
which.max(apply(USJudgeRatings,1,mean))
## CALLAHAN,R.J.
## 7
We see that RJ Callahan had the highest rating of 8.858333.
Similarly, it would be easy to compute the mean rating of all of the judges in each category.
apply(USJudgeRatings, 2, mean)
## CONT INTG DMNR DILG CFMG DECI PREP FAMI
## 7.437209 8.020930 7.516279 7.693023 7.479070 7.565116 7.467442 7.488372
## ORAL WRIT PHYS RTEN
## 7.293023 7.383721 7.934884 7.602326
And we see that judges scored highes on the Integrity scale.
The sapply
function has as its typical usage sapply(X, FUN)
, where X
is a vector, and FUN
is the function that we wish to act on each element of X
. The function sapply
then stores the results of the function FUN
in as simplified a version as possible. (This leads sapply
to get a bad name when writing source code, but is very useful when you are doing exploratory analysis.) As a simple example, let’s create a vector containing the numbers 1 through 10, squared.
sapply(X = 1:10, FUN = function(x) x^2)
## [1] 1 4 9 16 25 36 49 64 81 100
Note that sapply
takes each element of the vector X
, squares it, and stores the squared value in a vector. The symbol x
in function(x) x^2
is a dummy variable, and can be replaced by any legal R variable name.
A common usage of sapply
in interactive coding is to take advantage of a known structure of a data frame. For example, suppose we wanted to scale all of the numeric variables in iris
so that they have mean 0 and standard deviation 1. It would be simple enough to just look at iris
to determine which are numeric, but we can automate it for use with other data frames.
is_numeric_col <- sapply(1:ncol(iris), function(x) is.numeric(iris[1,x]))
numeric_col <- which(is_numeric_col)
iris[,numeric_col] <- apply(iris[,numeric_col], 2, function(x) (x - mean(x))/sd(x))
Finally, we note that replicate
is a “convenience wrapper” for a common usage of sapply
, so that replicate(100, {expr})
is equivalent to sapply(1:100, function(x) {expr})
, where in this case, the function inside of sapply
will have no dependence on x
.
5.6 Exercises
- Consider the
mpg
data set in theggplot2
package.- Which car(s) had the highest highway gas mileage? (For the purposes of this question, consider each observation a different car.)
- Compute the mean city mileage for compact cars.
- Compute the mean city mileage for each class of cars, and arrange in decreasing order.
- Which cars have the smallest absolute difference between highway mileage and city mileage? (For the purposes of this question, consider each observation a different “car”.)
- Compute the mean highway mileage for each year, and arrange in decreasing order.
- Consider the
movieLensData
.- What is the movie with the highest mean rating that has been rated at least 30 times?
- Which genre has been rated the most? (For the purpose of this, consider Comedy and Comedy|Romance completely different genres, for example.)
- Which movie in the genre Comedy|Romance that has been rated at least 75 times has the lowest mean rating? Which has the highest mean rating?
- Which movie that has a mean rating of 4 or higher has been rated the most times?
- Which user gave the highest mean ratings?
- Consider the
Batting
data set in theLahman
library. This gives the batting statistics of 101,332 players who have played baseball from 1871 through 2015. Answer the following questions.- Which player has been hit-by-pitch the most number of times?
- How many doubles were hit in 1871?
- Which team has the most number of home runs?
- Which player who has played in at least 500 games has scored the most number of runs per game?
- Which player has the most lifetime at bats without ever having hit a home run?
- Which active player has the most lifetime at bats without ever having hit a home run? (An active player is someone with an entry in the most recent year of the data set).
- More questions on the
Batting
data set. Make sure to take into accountstint
in these problems.- Verify that Curtis Granderson hit the most triples in a single season since 1960.
- In which season did the major league leader in triples have the fewest triples?
- In which season was there the biggest difference between the major league leader in stolen bases (SB) and the player with the second most stolen bases?
- Consider the
Pitching
data in theLahman
data set.- Which pitcher has won (W) the most number of games?
- Which pitcher has lost (L) the most number of games?
- Which pitcher has hit the most opponents with a pitch (HBP)?
- Which year had the most number of complete games (CG)?
- Among pitchers who have won at least 100 games, which has the highest winning percentage? (Winning percentage is wins divided by wins + losses.)
- Among pitchers who have struck out at least 500 batters, which has the highest strikeout to walk ratio? (Strikeout to walk ratio is SO/BB.)
- List the pitchers for the St Louis Cardinals (SLN) in 2006 with at least 30 recorded outs (IPouts), sorted by ERA from lowest to highest.
- This question uses the
DrinksWages
from theHistData
library. This data, gathered by Karl Pearson in 1910, was a survey of people working in various trades (bakers, plumbers, goldbeaters, etc.). The trades are assigned class values of A, B, or C based on required skill. For each trade, he counted the number of workers who drink (drinks
), number of sober workers (sober
), and recorded wage information (wage
). There is also a columnn = drinks + sober
which is the total number of workers surveyed for each trade.- Compute the mean wages for each class, A, B, and C.
- Find the three trades with the highest proportion of drinkers. Consider only trades with 10 or more workers in the survey.
- Consider the dataset
oly12
from theVGAMdata
package (that you will probably need to install). It has individual competitor information from the Summer 2012 London Olympic Games.- According to this data, which country won the most medals? How many did that country win? (You need to sum Gold, Silver, and Bronze)
- Which countries were the heaviest? Compute the mean weight of male athletes for all countries with at least 10 competitors, and report the top three.
- There is a built in data set
sentences
in thestringr
package. All questions below refer to that data set.- What percentage of sentences contain the string “the”?
- What percentage of sentences contain the word “the” (so, either “the” or “The”)?
- What percentage of sentences start with the word “The”?
- Which words are the most common words that a sentence ends with?
Complete the Data Manipulation in R with dplyr datacamp tutorial. Note that only the first set of exercises is free.
Complete the Cleaning Data in R datacamp tutorial. This course uses the
tidyr
package, which is not covered in the textbook. However, it is a good complement to thedplyr
material in this chapter.
5.7 Solutions to Two Data Wrangling Exercises
1(a).
movies %>%
group_by(Title) %>%
summarize(meanRating = mean(Rating), numRating = n()) %>%
filter(numRating >= 30) %>%
arrange(desc(meanRating))
## # A tibble: 874 x 3
## Title meanRating numRating
## <chr> <dbl> <int>
## 1 Shawshank Redemption, The (1994) 4.44 344
## 2 Godfather, The (1972) 4.43 228
## 3 Wallace & Gromit: A Close Shave (1995) 4.42 83
## 4 Schindlers List (1993) 4.40 285
## 5 Usual Suspects, The (1995) 4.39 253
## 6 Godfather: Part II, The (1974) 4.39 148
## 7 Lawrence of Arabia (1962) 4.37 74
## 8 Big Night (1996) 4.36 33
## 9 City of God (Cidade de Deus) (2002) 4.34 44
## 10 Sting, The (1973) 4.34 89
## # … with 864 more rows
1(b).
movies %>%
group_by(Genres) %>%
summarize(numRatings = n()) %>%
arrange(desc(numRatings))
## # A tibble: 644 x 2
## Genres numRatings
## <chr> <int>
## 1 Drama 8141
## 2 Comedy 7801
## 3 Comedy|Romance 3983
## 4 Comedy|Drama 3550
## 5 Comedy|Drama|Romance 2942
## 6 Drama|Romance 2775
## 7 Action|Adventure|Sci-Fi 2604
## 8 Action|Adventure|Thriller 1763
## 9 Drama|Thriller 1627
## 10 Crime|Drama 1566
## # … with 634 more rows
Rating should almost certainly have been reclassified as an ordered factor, and it is not clear that taking the mean is really a valid thing to do.↩