Chapter 5 Data Manipulation

In this chapter we introduce the tidyverse. The tidyverse consists of about a dozen R packages designed to work together, organized around common principles. The central organizing principle is that data should be tidy: Data should be rectangular, each row should correspond to one observation, and each column should corresponds to one observed variable. Data should not be stored in the names of variables.

The primary tidyverse tool we will use is the dplyr library (pronounced “dee - ply - er”). We will introduce stringr for basic string operations. We also introduce data tidying with pivot, and show how to combine multiple data sources with join.

To use these tools, you will need to install them. The simplest method is to install the entire tidyverse package, with

install.packages("tidyverse")

Then you can load all the tidyverse tools at once with

library(tidyverse)

Alternately, you can choose to be more selective and install and load each package as you need them. This chapter requires dplyr:

library(dplyr)

The advantage to using individual packages is that the tidyverse package is large and has many effects you may not understand. Using dplyr by itself keeps change to a minimum, and also helps you to learn the correct location of the tools you learn in this chapter.

5.1 Data frames and tibbles

Data in R is usually stored in data frames, with one row per observation and one column per variable. The tidyverse tools work naturally with data frames, but prefer a new data format called a tibble. Sometimes tidyverse tools will automatically convert data frames into tibbles. Or, you can make the conversion yourself using the as_tibble function:

as_tibble(mtcars)

## # A tibble: 32 x 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # … with 22 more rows

As usual, this does not change mtcars into a tibble unless we store the result back into the variable via mtcars <- as_tibble(mtcars). Most of the time, you don’t need to worry about the difference between a data frame and a tibble, but tibbles have a couple of advantages over data frames:

Printing. By now you have probably noticed that if you print a data frame with thousands of rows, you get thousands of rows of output. This is rarely desirable. Printing a tibble will only show you the first 10 rows of data, and will shorten or remove columns if they do not fit in your output window. Printing a tibble also shows the size of the data and they types of the variables automatically, as you can see in the example above.
Type consistency. If you select two columns of a tibble, you get a tibble. If you select one column of a tibble, you get a tibble. This is type consistency. Data frames do not have this behavior, since selecting two columns gives a data frame but selecting one column gives a vector:

mtcars[1:5,c('am','gear','carb')]  # data frame

##                   am gear carb
## Mazda RX4          1    4    4
## Mazda RX4 Wag      1    4    4
## Datsun 710         1    4    1
## Hornet 4 Drive     0    3    1
## Hornet Sportabout  0    3    2

mtcars[1:5,'carb']  # vector

## [1] 4 4 1 1 2

In this chapter, we will primarily use a data set consisting of user generated numerical ratings of movies. This data comes from the web site MovieLens (movielens.org), a non-commercial site for movie recommendations. The data is collected and distributed by GroupLens research of the University of Minnesota and made available for research and educational use.

The MovieLens data contains the variables MovieID, Title, Genres, UserID, Rating and Timestamp. The data set we use in this text is a small piece of the 25 megabyte MovieLens data. This smaller set consists of 100,000 observations taken at a random starting point, and includes the reviews of 746 consecutively numbered users.

Let’s load this data set into a variable we will call movies, and then convert it into a tibble:

movies <- read.csv("http://stat.slu.edu/~speegle/_book_data/movieLensData", as.is = TRUE)
movies <- as_tibble(movies)

Here is how it looks:

movies

## # A tibble: 100,000 x 6
##    MovieID Title         Genres                          UserID Rating Timestamp
##      <int> <chr>         <chr>                            <int>  <dbl>     <int>
##  1       1 Toy Story (1… Adventure|Animation|Children|C…   2303    4      8.58e8
##  2       1 Toy Story (1… Adventure|Animation|Children|C…   2765    5      9.82e8
##  3       1 Toy Story (1… Adventure|Animation|Children|C…   2272    5      8.35e8
##  4       1 Toy Story (1… Adventure|Animation|Children|C…   2736    3      1.19e9
##  5       1 Toy Story (1… Adventure|Animation|Children|C…   2662    2      1.11e9
##  6       1 Toy Story (1… Adventure|Animation|Children|C…   2161    4      9.61e8
##  7       1 Toy Story (1… Adventure|Animation|Children|C…   2494    4      8.50e8
##  8       1 Toy Story (1… Adventure|Animation|Children|C…   2385    4.5    1.08e9
##  9       1 Toy Story (1… Adventure|Animation|Children|C…   2236    4      8.35e8
## 10       1 Toy Story (1… Adventure|Animation|Children|C…   2747    5      9.82e8
## # … with 99,990 more rows

5.2 dplyr verbs

The dplyr library is organized around commands called verbs. Each verb takes a tibble (or data frame) as its first argument and possibly additional arguments.

The first verb we will meet is filter, which forms a new data frame consisting of rows that satisfy certain filtering conditions:

Here we create a new data frame with all 177 reviews of the 1999 film “Fight Club”:

filter(movies, Title == "Fight Club (1999)")

## # A tibble: 177 x 6
##    MovieID Title             Genres                      UserID Rating Timestamp
##      <int> <chr>             <chr>                        <int>  <dbl>     <int>
##  1    2959 Fight Club (1999) Action|Crime|Drama|Thriller   2429    3.5    1.16e9
##  2    2959 Fight Club (1999) Action|Crime|Drama|Thriller   2723    4.5    1.09e9
##  3    2959 Fight Club (1999) Action|Crime|Drama|Thriller   2558    4      1.19e9
##  4    2959 Fight Club (1999) Action|Crime|Drama|Thriller   2414    5      1.19e9
##  5    2959 Fight Club (1999) Action|Crime|Drama|Thriller   2756    5      1.09e9
##  6    2959 Fight Club (1999) Action|Crime|Drama|Thriller   2281    5      1.09e9
##  7    2959 Fight Club (1999) Action|Crime|Drama|Thriller   2186    4      1.22e9
##  8    2959 Fight Club (1999) Action|Crime|Drama|Thriller   2829    4.5    1.13e9
##  9    2959 Fight Club (1999) Action|Crime|Drama|Thriller   2822    4      1.02e9
## 10    2959 Fight Club (1999) Action|Crime|Drama|Thriller   2470    5      1.11e9
## # … with 167 more rows

Observe the use of == for comparison. In R (and most modern programming languages), there are some fundamental comparison operators:

== equal to
!= not equal to
> greater than
< less than
>= greater than or equal to
<= less than or equal to

R has others (such as %in% to tell if a value is in a vector), but these six are the most useful.

Here are two more examples of the filter command. Find all user Ratings of 1 or less:

filter(movies, Rating <= 1)

## # A tibble: 4,815 x 6
##    MovieID Title            Genres                       UserID Rating Timestamp
##      <int> <chr>            <chr>                         <int>  <dbl>     <int>
##  1       1 Toy Story (1995) Adventure|Animation|Childre…   2364    1      9.49e8
##  2       1 Toy Story (1995) Adventure|Animation|Childre…   2821    1      9.46e8
##  3       1 Toy Story (1995) Adventure|Animation|Childre…   2549    1      8.35e8
##  4       2 Jumanji (1995)   Adventure|Children|Fantasy     2335    1      9.61e8
##  5       2 Jumanji (1995)   Adventure|Children|Fantasy     2304    1      9.13e8
##  6       2 Jumanji (1995)   Adventure|Children|Fantasy     2879    1      1.11e9
##  7       2 Jumanji (1995)   Adventure|Children|Fantasy     2425    1      9.62e8
##  8       2 Jumanji (1995)   Adventure|Children|Fantasy     2342    1      1.11e9
##  9       3 Grumpier Old Me… Comedy|Romance                 2672    0.5    1.11e9
## 10       3 Grumpier Old Me… Comedy|Romance                 2596    1      1.00e9
## # … with 4,805 more rows

All reviews of 1 or less for Fight Club:

filter(movies, Title == "Fight Club (1999)", Rating <= 1)

## # A tibble: 1 x 6
##   MovieID Title             Genres                      UserID Rating Timestamp
##     <int> <chr>             <chr>                        <int>  <dbl>     <int>
## 1    2959 Fight Club (1999) Action|Crime|Drama|Thriller   2468      1 982046191

It turns out there is only one user in this data set who really disliked Fight Club!

We will use the following verbs regularly when working with data:

filter() forms a new data frame consisting of rows that satisfy certain filtering conditions.
select() forms a new data frame with selected columns.
arrange() forms a new data frame with row(s) arranged in a specified order.
top_n() filters the top n rows according to some ranking.
summarize() summarizes a data frame into a single row.
distinct() collapses identical data to produces a single row for each distinct value
mutate() creates new variables by computation.

Example Applying dplyr verbs to the MovieLens data.

We do not reproduce the output of these commands, so you should try them yourself to see exactly what they do:

Select the columns Rating and MovieID (in that order):
select(movies, Rating, MovieID)
Arrange the ratings by the date they were reviewed:
arrange(movies, Timestamp)
Arrange the ratings in descending order of Rating:
arrange(movies, desc(Rating))
Find the last five reviews (by Timestamp) in the data:
top_n(movies, n=5, Timestamp)
Filter to find all the half-star ratings:
filter(movies, Rating != round(Rating))
Summarize by finding the mean of all Ratings in the data set:
summarize(movies, mean(Rating))
Form a data frame consisting of the unique User ID’s.
distinct(movies, UserID)
Mutate the timestamp to a human readable date and time in the new variable called when:
mutate(movies, when = lubridate::as_datetime(Timestamp))
Note that this uses a function from lubridate library, part of the tidyverse.

5.3 dplyr pipelines

Verb commands are simple, but are designed to be used together to perform more complicated operations. The pipe operator is the dplyr method for combining verbs. Pipe is the three-character symbol %>%, which you can also type using the three key combination ctrl-shift-m. Pipe works by taking the value produced on its left and feeding it as the first argument of the function on its right.

For example, we can sort ratings of the 1958 movie “Vertigo” by Timestamp using arrange and filter together.

movies %>%
  filter(Title == "Vertigo (1958)") %>%
  arrange(Timestamp)

## # A tibble: 75 x 6
##    MovieID Title          Genres                         UserID Rating Timestamp
##      <int> <chr>          <chr>                           <int>  <dbl>     <int>
##  1     903 Vertigo (1958) Drama|Mystery|Romance|Thriller   2581      5 849795136
##  2     903 Vertigo (1958) Drama|Mystery|Romance|Thriller   2332      4 854533817
##  3     903 Vertigo (1958) Drama|Mystery|Romance|Thriller   2739      3 858226200
##  4     903 Vertigo (1958) Drama|Mystery|Romance|Thriller   2274      4 913332229
##  5     903 Vertigo (1958) Drama|Mystery|Romance|Thriller   2373      4 913371154
##  6     903 Vertigo (1958) Drama|Mystery|Romance|Thriller   2424      5 913384905
##  7     903 Vertigo (1958) Drama|Mystery|Romance|Thriller   2703      5 913430752
##  8     903 Vertigo (1958) Drama|Mystery|Romance|Thriller   2526      4 941832384
##  9     903 Vertigo (1958) Drama|Mystery|Romance|Thriller   2466      5 941832691
## 10     903 Vertigo (1958) Drama|Mystery|Romance|Thriller   2534      4 941835354
## # … with 65 more rows

This has the same effect as the harder to read command:

arrange(filter(movies,Title == "Vertigo (1958)"),Timestamp)

With pipelines, we imagine the data flowing into the pipe (movies) then passing through the verbs in sequence, first being filtered and then being arranged. Pipelines also make it natural to break up long commands into multiple lines after each pipe operator, although this is not required.

Example Find the mean¹⁰ rating of Toy Story, which has MovieID==1.

movies %>%
  filter(MovieID == 1) %>%
  summarize(mean(Rating))

## # A tibble: 1 x 1
##   `mean(Rating)`
##            <dbl>
## 1           3.88

The filter() command creates a data frame that consists solely of the observations of Toy Story, and summarize() computes the mean rating.

Pipelines can be used with any R function, not just dplyr verbs. One handy trick, especially if your data is not a tibble, is to pipe into head:

movies %>%
  filter(Rating == 5) %>%
  select(Title) %>%
  distinct() %>%
  head(n=5)

## # A tibble: 5 x 1
##   Title                             
##   <chr>                             
## 1 Toy Story (1995)                  
## 2 Jumanji (1995)                    
## 3 Grumpier Old Men (1995)           
## 4 Waiting to Exhale (1995)          
## 5 Father of the Bride Part II (1995)

This pipeline found all movies which someone rated 5, selected only the Title variable, and then used head to show only the first five distinct titles.

5.3.1 Group by and summarize

The MovieLens data has one observation for each user review. However, we are often interested in working with the movies themselves. The tool to do this is the dplyr verb group_by, which groups data to perform tasks by groups. By itself, group_by has little effect - in fact, the only visual indication that it has done anything is a line at the top of the tibble output noting that there are Groups.

Here is a simple example with the chickwts data set, which gives the weights of chickens eating different types of feed supplement:

chickwts %>% group_by(feed)

## # A tibble: 71 x 2
## # Groups:   feed [6]
##    weight feed     
##     <dbl> <fct>    
##  1    179 horsebean
##  2    160 horsebean
##  3    136 horsebean
##  4    227 horsebean
##  5    217 horsebean
##  6    168 horsebean
##  7    108 horsebean
##  8    124 horsebean
##  9    143 horsebean
## 10    140 horsebean
## # … with 61 more rows

In essentially all uses, we follow group_by with another operation on the groups, most commonly summarize:

chickwts %>%
  group_by(feed) %>%
  summarize(mean(weight))

## # A tibble: 6 x 2
##   feed      `mean(weight)`
##   <fct>              <dbl>
## 1 casein              324.
## 2 horsebean           160.
## 3 linseed             219.
## 4 meatmeal            277.
## 5 soybean             246.
## 6 sunflower           329.

This produced a new tibble with one row for each group, and created the new variable with the awkward name mean(weight) which records the mean weight for each group. We could give that variable an easier name to type by assigning it as part of the summarize operation: summarize(mw = mean(weight))

Here is another example, showing that we can produce multiple new summary variables. It also illustrates the special function n(), which gives the number of rows in each group. The chickwts experiment had different numbers of chickens in each group:

chickwts %>%
  group_by(feed) %>%
  summarize(mw = mean(weight), count = n())

## # A tibble: 6 x 3
##   feed         mw count
##   <fct>     <dbl> <int>
## 1 casein     324.    12
## 2 horsebean  160.    10
## 3 linseed    219.    12
## 4 meatmeal   277.    11
## 5 soybean    246.    14
## 6 sunflower  329.    12

5.4 The power of dplyr

With a small collection of verbs and the pipe operator, you are well equipped to perform complicated data analysis. This section shows some of the techniques you can use to learn answers from data.

Example Create a data frame consisting of the observations associated with movies that have an average rating of 5 stars. That is, each rating of the movie was 5 stars.

In order to do this, we will use the group_by() function to find the mean rating of each movie.

movies %>% 
  group_by(Title) %>%
  summarize(Rating = mean(Rating))

## # A tibble: 6,200 x 2
##    Title                                                   Rating
##    <chr>                                                    <dbl>
##  1 ...All the Marbles (a.k.a. The California Dolls) (1981)   3.5 
##  2 ...And God Spoke (1993)                                   4   
##  3 ...And Justice for All (1979)                             3.6 
##  4 *batteries not included (1987)                            3.12
##  5 10 (1979)                                                 3.5 
##  6 10 Items or Less (2006)                                   3   
##  7 10 Rillington Place (1971)                                4.25
##  8 10 Things I Hate About You (1999)                         3.45
##  9 10 to Midnight (1983)                                     2.5 
## 10 101 Dalmatians (1996)                                     3.19
## # … with 6,190 more rows

Now, we could sort those in decreasing order to see ones that have a mean rating of 5.

movies %>% 
  group_by(Title) %>%
  summarize(Rating = mean(Rating)) %>%
  arrange(desc(Rating))

## # A tibble: 6,200 x 2
##    Title                                                Rating
##    <chr>                                                 <dbl>
##  1 49 Up (2005)                                              5
##  2 5,000 Fingers of Dr. T, The (1953)                        5
##  3 Anastasia (1956)                                          5
##  4 Ay, Carmela! (¡Ay, Carmela!) (1990)                       5
##  5 Ballad of Narayama, The (Narayama bushiko) (1983)         5
##  6 Band of the Hand (1986)                                   5
##  7 Beerfest (2006)                                           5
##  8 Big Clock, The (1948)                                     5
##  9 Boss of It All, The (Direktøren for det hele) (2006)      5
## 10 Boys (1996)                                               5
## # … with 6,190 more rows

Now, we can filter out those whose mean rating is 5. Note, it is not necessary to sort first.

movies %>% 
  group_by(Title) %>%
  summarize(Rating = mean(Rating)) %>%
  filter(Rating == 5)

## # A tibble: 113 x 2
##    Title                                                Rating
##    <chr>                                                 <dbl>
##  1 49 Up (2005)                                              5
##  2 5,000 Fingers of Dr. T, The (1953)                        5
##  3 Anastasia (1956)                                          5
##  4 Ay, Carmela! (¡Ay, Carmela!) (1990)                       5
##  5 Ballad of Narayama, The (Narayama bushiko) (1983)         5
##  6 Band of the Hand (1986)                                   5
##  7 Beerfest (2006)                                           5
##  8 Big Clock, The (1948)                                     5
##  9 Boss of It All, The (Direktøren for det hele) (2006)      5
## 10 Boys (1996)                                               5
## # … with 103 more rows

And we see that there are 113 movies which have mean rating of 5.

Example Find the lowest rated movie.

movies %>%
  group_by(Title) %>%
  summarize(mr = mean(Rating)) %>%
  arrange(mr)

## # A tibble: 6,200 x 2
##    Title                                            mr
##    <chr>                                         <dbl>
##  1 Avventura, L (a.k.a. Adventure, The) (1960)     0.5
##  2 Big Combo, The (1955)                           0.5
##  3 BloodRayne (2005)                               0.5
##  4 Bratz: The Movie (2007)                         0.5
##  5 Care Bears Movie II: A New Generation (1986)    0.5
##  6 Chaos (2001)                                    0.5
##  7 Chinese Ghost Story, A (Sinnui yauwan) (1987)   0.5
##  8 Devil Doll (1964)                               0.5
##  9 Diary of a Mad Black Woman (2005)               0.5
## 10 Gate II: Trespassers, The (1990)                0.5
## # … with 6,190 more rows

Yikes! 0.5 out of 5 stars!

To see which movie that received 5 stars has the most ratings, we would do:

movies %>%
  group_by(Title) %>%
  summarize(mr = mean(Rating), numRating = n()) %>%
  arrange(desc(mr), desc(numRating))

## # A tibble: 6,200 x 3
##    Title                                       mr numRating
##    <chr>                                    <dbl>     <int>
##  1 Lady Eve, The (1941)                         5         4
##  2 Frances (1982)                               5         3
##  3 Boys (1996)                                  5         2
##  4 Denise Calls Up (1995)                       5         2
##  5 Manny & Lo (1996)                            5         2
##  6 Pier, The (La Jetée) (1962)                  5         2
##  7 Tin Drum, The (Blechtrommel, Die) (1979)     5         2
##  8 Tinker, Tailor, Soldier, Spy (1979)          5         2
##  9 Trouble in Paradise (1932)                   5         2
## 10 49 Up (2005)                                 5         1
## # … with 6,190 more rows

And, we see that The Lady Eve had 4 ratings. Never heard of it. If you are interested, IMDB says “A trio of classy card sharps targets the socially awkward heir to brewery millions for his money, until one of them falls in love with him.” Out of curiosity, what is The Lady Eve’s movie ID?

movies %>%
  filter(Title == "Lady Eve, The (1941)") %>%
  select(MovieID) %>%
  distinct()

## # A tibble: 1 x 1
##   MovieID
##     <int>
## 1    2935

Next, out of movies with a lot of ratings, which has the highest rating? Well, we need to decide what a lot of ratings means. Let’s see how many ratings some of the movies had.

movies %>%
  group_by(Title) %>%
  summarize(count = n()) %>%
  arrange(desc(count))

## # A tibble: 6,200 x 2
##    Title                                                        count
##    <chr>                                                        <int>
##  1 Silence of the Lambs, The (1991)                               368
##  2 Pulp Fiction (1994)                                            359
##  3 Forrest Gump (1994)                                            349
##  4 Shawshank Redemption, The (1994)                               344
##  5 Jurassic Park (1993)                                           338
##  6 Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)   323
##  7 Fugitive, The (1993)                                           305
##  8 Terminator 2: Judgment Day (1991)                              296
##  9 Independence Day (a.k.a. ID4) (1996)                           295
## 10 Apollo 13 (1995)                                               291
## # … with 6,190 more rows

If we want a list of the 5 most rated movies, we could do:

movies %>%
  group_by(Title) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  head(n = 5)

## # A tibble: 5 x 2
##   Title                            count
##   <chr>                            <int>
## 1 Silence of the Lambs, The (1991)   368
## 2 Pulp Fiction (1994)                359
## 3 Forrest Gump (1994)                349
## 4 Shawshank Redemption, The (1994)   344
## 5 Jurassic Park (1993)               338

So, it seems like 200 ratings could classify as “a lot”. Let’s see which movie with at least 200 ratings has the highest mean rating.

movies %>%
  group_by(Title) %>%
  summarize(count = n(), meanRating = mean(Rating)) %>%
  filter(count > 200) %>%
  arrange(desc(meanRating))

## # A tibble: 40 x 3
##    Title                                                        count meanRating
##    <chr>                                                        <int>      <dbl>
##  1 Shawshank Redemption, The (1994)                               344       4.44
##  2 Godfather, The (1972)                                          228       4.43
##  3 Schindlers List (1993)                                         285       4.40
##  4 Usual Suspects, The (1995)                                     253       4.39
##  5 Silence of the Lambs, The (1991)                               368       4.29
##  6 Matrix, The (1999)                                             246       4.26
##  7 American Beauty (1999)                                         258       4.25
##  8 Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)   323       4.25
##  9 Raiders of the Lost Ark (Indiana Jones and the Raiders of t…   228       4.23
## 10 Sixth Sense, The (1999)                                        204       4.20
## # … with 30 more rows

I have to say, that’s not a bad list of movies. If we want to see all of the movies in this list, we can pipe to print, and specify the number of movies we wish to see. (If you definitely want to see all of the movies, you can pipe to print(n = nrow(.)))

movies %>%
  group_by(Title) %>%
  summarize(count = n(), meanRating = mean(Rating)) %>%
  filter(count > 200) %>%
  arrange(desc(meanRating)) %>%
  select(Title, meanRating) %>% 
  print(n = 40)

## # A tibble: 40 x 2
##    Title                                                              meanRating
##    <chr>                                                                   <dbl>
##  1 Shawshank Redemption, The (1994)                                         4.44
##  2 Godfather, The (1972)                                                    4.43
##  3 Schindlers List (1993)                                                   4.40
##  4 Usual Suspects, The (1995)                                               4.39
##  5 Silence of the Lambs, The (1991)                                         4.29
##  6 Matrix, The (1999)                                                       4.26
##  7 American Beauty (1999)                                                   4.25
##  8 Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)             4.25
##  9 Raiders of the Lost Ark (Indiana Jones and the Raiders of the Los…       4.23
## 10 Sixth Sense, The (1999)                                                  4.20
## 11 Star Wars: Episode V - The Empire Strikes Back (1980)                    4.20
## 12 Fargo (1996)                                                             4.10
## 13 Fugitive, The (1993)                                                     4.09
## 14 Pulp Fiction (1994)                                                      4.08
## 15 Saving Private Ryan (1998)                                               4.05
## 16 Braveheart (1995)                                                        4.04
## 17 Seven (a.k.a. Se7en) (1995)                                              4.04
## 18 Star Wars: Episode VI - Return of the Jedi (1983)                        4.03
## 19 Forrest Gump (1994)                                                      4.00
## 20 Apollo 13 (1995)                                                         3.95
## 21 Terminator 2: Judgment Day (1991)                                        3.94
## 22 12 Monkeys (Twelve Monkeys) (1995)                                       3.93
## 23 Toy Story (1995)                                                         3.88
## 24 Rock, The (1996)                                                         3.81
## 25 Lion King, The (1994)                                                    3.78
## 26 Back to the Future (1985)                                                3.77
## 27 Dances with Wolves (1990)                                                3.70
## 28 E.T. the Extra-Terrestrial (1982)                                        3.70
## 29 Aladdin (1992)                                                           3.67
## 30 Beauty and the Beast (1991)                                              3.66
## 31 Jurassic Park (1993)                                                     3.58
## 32 Die Hard: With a Vengeance (1995)                                        3.52
## 33 Speed (1994)                                                             3.50
## 34 Independence Day (a.k.a. ID4) (1996)                                     3.48
## 35 True Lies (1994)                                                         3.48
## 36 Mission: Impossible (1996)                                               3.43
## 37 Batman (1989)                                                            3.42
## 38 Stargate (1994)                                                          3.40
## 39 Twister (1996)                                                           3.26
## 40 Ace Ventura: Pet Detective (1994)                                        3.08

You just got your summer movie watching list. You’re welcome.

Let’s figure out which genre had the highest rating, which movie was rated the most, and the 10 worst rated movies that were rated at least 100 times, Try it yourself and then continue reading to see how.

The highest rated Genre:

## # A tibble: 644 x 2
##    Genres                               GenreRating
##    <chr>                                      <dbl>
##  1 Comedy|Drama|Fantasy|Musical|Romance        5   
##  2 Animation|Fantasy|Horror                    4.67
##  3 Animation|Comedy|Fantasy|Musical            4.5 
##  4 Animation|Drama                             4.5 
##  5 Comedy|Drama|Sci-Fi                         4.5 
##  6 Crime|Drama|Fantasy|Film-Noir|Horror        4.5 
##  7 Crime|Drama|War                             4.5 
##  8 Documentary|Fantasy                         4.5 
##  9 Drama|Fantasy|Horror|Romance                4.5 
## 10 Drama|Horror|Romance                        4.5 
## # … with 634 more rows

What kind of genre is Comedy|Drama|Fantasy|Musical|Romance? That is crazy. What movie could that be?

movies %>%
  filter(Genres == "Comedy|Drama|Fantasy|Musical|Romance") %>%
  select(Title) %>%
  distinct()

## # A tibble: 1 x 1
##   Title                                 
##   <chr>                                 
## 1 Saddest Music in the World, The (2003)

The Saddest Music in the World. According to IMDB, “A musical of sorts set in Winnipeg during the Great Depression, where a beer baroness organizes a contest to find the saddest music in the world. Musicians from around the world descend on the city to try and win the $25,000 prize.”

The movie that was rated the most:

## # A tibble: 6,200 x 2
##    Title                                                        numRating
##    <chr>                                                            <int>
##  1 Silence of the Lambs, The (1991)                                   368
##  2 Pulp Fiction (1994)                                                359
##  3 Forrest Gump (1994)                                                349
##  4 Shawshank Redemption, The (1994)                                   344
##  5 Jurassic Park (1993)                                               338
##  6 Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)       323
##  7 Fugitive, The (1993)                                               305
##  8 Terminator 2: Judgment Day (1991)                                  296
##  9 Independence Day (a.k.a. ID4) (1996)                               295
## 10 Apollo 13 (1995)                                                   291
## # … with 6,190 more rows

The 10 worst rated movies that were rated at least 100 times:

## # A tibble: 195 x 3
##    Title                                 MeanRating numRating
##    <chr>                                      <dbl>     <int>
##  1 Judge Dredd (1995)                          2.42       107
##  2 Honey, I Shrunk the Kids (1989)             2.64       105
##  3 Waterworld (1995)                           2.72       180
##  4 Ace Ventura: When Nature Calls (1995)       2.73       124
##  5 Blair Witch Project, The (1999)             2.73       125
##  6 Armageddon (1998)                           2.77       129
##  7 Batman Forever (1995)                       2.89       197
##  8 Addams Family Values (1993)                 2.94       115
##  9 Charlies Angels (2000)                      2.95       100
## 10 Demolition Man (1993)                       3.02       126
## # … with 185 more rows

Example

When we first considered the Batting data set in the Lahman library, we ignored the stint variable. When a player played for more than one team in a season, the statistics of the player for each team is listed on separate lines. So, in chapter 2, when you were asked to find the player that hit the most triples in one season, you should have added up all of the triples that each player hit for the various teams. We didn’t have the tools to do that at the time, but now we can.

Let’s begin by creating a data frame that has players’ ID, year ID and the number of triples that they hit in that year.

library(Lahman)
Batting %>%
  group_by(playerID, yearID) %>%
  summarize(X3B = sum(X3B))

## # A tibble: 97,992 x 3
## # Groups:   playerID [19,428]
##    playerID  yearID   X3B
##    <chr>      <int> <int>
##  1 aardsda01   2004     0
##  2 aardsda01   2006     0
##  3 aardsda01   2007     0
##  4 aardsda01   2008     0
##  5 aardsda01   2009     0
##  6 aardsda01   2010     0
##  7 aardsda01   2012     0
##  8 aardsda01   2013     0
##  9 aardsda01   2015     0
## 10 aaronha01   1954     6
## # … with 97,982 more rows

Now, we just need to find the largest number of triples.

Batting %>%
  group_by(playerID, yearID) %>%
  summarize(X3B = sum(X3B)) %>%
  arrange(desc(X3B))

## # A tibble: 97,992 x 3
## # Groups:   playerID [19,428]
##    playerID  yearID   X3B
##    <chr>      <int> <int>
##  1 wilsoch01   1912    36
##  2 orrda01     1886    31
##  3 reitzhe01   1894    31
##  4 werdepe01   1893    29
##  5 davisha01   1897    28
##  6 thompsa01   1894    28
##  7 davisge01   1893    27
##  8 williji01   1899    27
##  9 crawfsa01   1914    26
## 10 cuyleki01   1925    26
## # … with 97,982 more rows

And, we see that the answer is still John Owen Wilson from 1912, just as it would have been without taking into account stints.

To work with career records in the Batting data set, we group by the playerID variable. Here is a list of the top five career home run leaders:

Batting %>%
  group_by(playerID) %>%
  summarize(HR = sum(HR)) %>%
  top_n(n=3, HR) %>%
  arrange(desc(HR))

## # A tibble: 3 x 2
##   playerID     HR
##   <chr>     <int>
## 1 bondsba01   762
## 2 aaronha01   755
## 3 ruthba01    714

That’s Barry Bonds, Hank Aaron, and Babe Ruth, if you can’t guess from their IDs.

5.5 Working with character strings

In this section, we use the stringr package to do some basic string manipulation on the movies data set. As far as R is concerned, strings are char variable types, for example:

my_name <- "Darrin Speegle"
str(my_name)

##  chr "Darrin Speegle"

Recall that str returns the structure of the variable, together with the contents if it is not too long. Even finding out how long a string is can be challenging for newcomers, because length gives the number of strings in the variable my_name:

length(my_name)

## [1] 1

One might very reasonably be interested in the number of spaces (a proxy for the number of words), the capital letters (a person’s initials), or in sorting a vector of characters containing names by the last name of the people. All of these tasks (and more) are made easier using the stringr package.

We will use the functions str_length, str_count, str_remove and str_extract, str_detect and str_split. The function str_length accepts one argument, a string, which can either be a single string or a vector of strings. It returns the length(s) of the string(s). For example,

stringr::str_length(my_name)

## [1] 14

If we had a vector of two names, then it works like this:

our_names <- c("Darrin Speegle", "Bryan Clair")
str_length(our_names)

## [1] 14 11

Next, we consider the function str_count, which takes two parameters, string and pattern. The first is a string, and the second is a regular expression that you use to indicate the pattern that you are counting. To count the number of spaces, we will pass a single character to the pattern, namely the space character.

stringr::str_count(our_names, " ")

## [1] 1 1

We see that there is one space in each of the two strings in the variable our_names. Regular expressions are very useful and powerful. If you end up doing much work with text data, you will not regret learning more about them. For example, if you want to match lower case vowels, you could use

str_count(our_names, "[aeiou]")

## [1] 5 3

Note the use of the grouping symbols [ and ]. If we want to match any lower case letter, we use [a-z], any upper case letter is [A-Z], and any digit is [0-9]. So, to count the number of upper case letters in the strings, we would use

str_count(our_names, "[A-Z]")

## [1] 2 2

Next, we examine str_extract and its sibling function str_extract_all. Suppose we want to extract the initials of the people in the vector our_names. Our first thought might be to use str_extract, which takes two parameters, string which contains the string(s) we wish to extract from, and pattern which is the pattern we wish to match, in order to extract.

stringr::str_extract(our_names, "[A-Z]")

## [1] "D" "B"

Note that this only extracts the first occurrence of the pattern in each string. To get all of them, we use str_extract_all

stringr::str_extract_all(our_names, "[A-Z]")

## [[1]]
## [1] "D" "S"
## 
## [[2]]
## [1] "B" "C"

Note the format of the output. It is a list of character vectors. Let’s check using str:

initials <- stringr::str_extract_all(our_names, "[A-Z]")
str(initials)

## List of 2
##  $ : chr [1:2] "D" "S"
##  $ : chr [1:2] "B" "C"

This is because str_extract_all doesn’t know how many matches there are going to be in each string, and because it doesn’t know that we are thinking of these as initials and want them all in the same string. That is, we would like a vector that looks like c("DS", "BC"). After learning about lapply we could apply str_flatten to each element of the list, but that seems like a long road to walk for this purpose. Perhaps easier is to do the opposite and remove all of the things that don’t match capital letters! To do this, we use str_remove_all, which again takes two arguments, string and pattern to remove. We need to know that regular expressions allow ^ to indicate a pattern is anything that does not match what is written.

stringr::str_remove_all(our_names, "[^A-Z]")

## [1] "DS" "BC"

Now might be a good time to mention that there is an excellent cheat sheet on the stringr package which includes basics on regular expressions here.

The last two functions that we are looking at are str_detect and str_split. These are most useful within a data analysis flow. The function str_detect accepts string and pattern and returns TRUE if the pattern is detected, and FALSE if the pattern is not detected. For example, we could look for the pattern an in our_names:

stringr::str_detect(our_names, "an")

## [1] FALSE  TRUE

Note that if we had put "an" inside a bracket like this "[an]", then str_detect would have looked for a pattern consisting of either an a or an n, and we would have received two TRUE values. Finally, suppose we want to split the strings into first name and last name. We use str_split or its sibling function str_split_fixed. The function str_split takes two arguments, a string and a pattern on which to split the string. Every time the function sees the pattern, it splits off a piece of the string. For example,

stringr::str_split(our_names, " ")

## [[1]]
## [1] "Darrin"  "Speegle"
## 
## [[2]]
## [1] "Bryan" "Clair"

Note that once again, the function returns a list of vectors of strings. If we know that we only want to split into at most 3 groups, say, then we can use str_split_fixed to indicate that. The first two arguments are the same, but str_split_fixed has a third argument, n which is the number of strings that we want to split each string into.

stringr::str_split_fixed(our_names, " ", n = 3)

##      [,1]     [,2]      [,3]
## [1,] "Darrin" "Speegle" ""  
## [2,] "Bryan"  "Clair"   ""

Now, let’s apply this to the movies data set. In particular, let’s consider the Genres variable. What are the highest rated movies that have been rated at least 50 times and include “Comedy” in its list of Genres?

movies %>% 
  filter(str_detect(Genres, "Comedy")) %>% 
  group_by(Title) %>% 
  summarize(rating = mean(Rating), count = n()) %>%
  filter(count >= 50) %>% 
  arrange(desc(rating)) %>% 
  print(n = 10)

## # A tibble: 207 x 3
##    Title                                                            rating count
##    <chr>                                                             <dbl> <int>
##  1 Wallace & Gromit: A Close Shave (1995)                             4.42    83
##  2 Sting, The (1973)                                                  4.34    89
##  3 Amelie (Fabuleux destin dAmélie Poulain, Le) (2001)                4.30    98
##  4 Wallace & Gromit: The Wrong Trousers (1993)                        4.28    99
##  5 Wallace & Gromit: The Best of Aardman Animation (1996)             4.28    57
##  6 Dr. Strangelove or: How I Learned to Stop Worrying and Love the…   4.26   132
##  7 Postman, The (Postino, Il) (1994)                                  4.23    75
##  8 Life Is Beautiful (La Vita è bella) (1997)                         4.23   111
##  9 One Flew Over the Cuckoos Nest (1975)                              4.23   142
## 10 Cinema Paradiso (Nuovo cinema Paradiso) (1989)                     4.23    59
## # … with 197 more rows

If we look at the movies data set, we can see a couple of things that could be improved. The organization of the Genres variable is not great, and makes it hard to work with. Also, the year that the movie was released is hidden inside of the title of the movie. Let’s pull out the year of release and move it into its own variable. Looking at the stringr cheat sheet, I see that we can anchor a pattern at the end of a string. It seems like the years are all at the end of a string, so let’s extract 4 digits, followed by a right parenthesis, at the end of a string. The pattern for this is “[0-9]{4}\)$”. The {4} indicates that we are matching the digit pattern four times, we have to escape the right parenthesis using \\ because ) is a reserved character in regular expressions, and the $ indicates that we are anchoring this to the end of the string. Let’s test it out:

str_extract(movies$Title, "[0-9]{4}\\)$") %>% 
  head()

## [1] "1995)" "1995)" "1995)" "1995)" "1995)" "1995)"

Hmmm, even better would be to get rid of the right parenthesis. We can use the look-ahead to only pull out 4 digits if they are followed by a right parenthesis and then the end of string.

str_extract(movies$Title, "[0-9]{4}(?=\\)$)") %>% 
  head()

## [1] "1995" "1995" "1995" "1995" "1995" "1995"

Sweet! Let’s convert to numeric and do a histogram, just to make sure no weird stuff snuck it.

str_extract(movies$Title, "[0-9]{4}(?=\\)$)") %>% 
  as.numeric() %>% 
  hist()

Looks pretty good. Let’s create a new variable called Year that contains the year of release of the movie.

movies$Year <- str_extract(movies$Title, "[0-9]{4}(?=\\)$)") %>% 
  as.numeric()
summary(movies$Year)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1915    1987    1994    1990    1998    2008

Now, what were the five highest rated movies of 1999 that had at least 50 ratings?

movies %>% 
  filter(Year == 1999) %>% 
  group_by(Title) %>% 
  summarize(mean = mean(Rating), count = n()) %>% 
  filter(count >= 50) %>% 
  arrange(desc(mean)) %>% 
  print(n = 5)

## # A tibble: 36 x 3
##   Title                    mean count
##   <chr>                   <dbl> <int>
## 1 Fight Club (1999)        4.30   177
## 2 Matrix, The (1999)       4.26   246
## 3 American Beauty (1999)   4.25   258
## 4 Sixth Sense, The (1999)  4.20   204
## 5 Green Mile, The (1999)   4.01   101
## # … with 31 more rows

5.6 The structure of data

5.6.1 Tidy data: pivoting

The tidyverse is designed for tidy data, and a key feature of tidy data is that all data should be stored in a rectangular array, with each row an observation and each column a variable. In particular, no information should be stored in variable names. As an example, consider the built in data set WorldPhones which gives the number of telephones in each region of the world, back in the day when telephones were still gaining popularity as a device for making phone calls.

The WorldPhones data is stored as a matrix with row names, so we convert it to a tibble called phones and preserve the row names in a new variable called Year:

phones <- as_tibble(WorldPhones, rownames = "Year")
phones

## # A tibble: 7 x 8
##   Year  N.Amer Europe  Asia S.Amer Oceania Africa Mid.Amer
##   <chr>  <dbl>  <dbl> <dbl>  <dbl>   <dbl>  <dbl>    <dbl>
## 1 1951   45939  21574  2876   1815    1646     89      555
## 2 1956   60423  29990  4708   2568    2366   1411      733
## 3 1957   64721  32510  5230   2695    2526   1546      773
## 4 1958   68484  35218  6662   2845    2691   1663      836
## 5 1959   71799  37598  6856   3000    2868   1769      911
## 6 1960   76036  40341  8220   3145    3054   1905     1008
## 7 1961   79831  43173  9053   3338    3224   2005     1076

Notice that the column names are regions of the world, so that there is useful information stored in those names.

Every entry in this data set gives the value for a year and a region, so the tidy format should instead have three variables: Year, Region, and Telephones. Making this change will cause this data set to become much longer. Instead of 7 rows and 7 columns, we will have 49 rows, one for each of the $7\times7$ Year and Region combinations. The tool to make this change is the pivot_longer function from the tidyr package (part of the tidyverse):

library(tidyr)
phones %>% pivot_longer(cols = -Year, names_to="Region", values_to="Telephones")

## # A tibble: 49 x 3
##    Year  Region   Telephones
##    <chr> <chr>         <dbl>
##  1 1951  N.Amer        45939
##  2 1951  Europe        21574
##  3 1951  Asia           2876
##  4 1951  S.Amer         1815
##  5 1951  Oceania        1646
##  6 1951  Africa           89
##  7 1951  Mid.Amer        555
##  8 1956  N.Amer        60423
##  9 1956  Europe        29990
## 10 1956  Asia           4708
## # … with 39 more rows

This function used four arguments. The first was the data frame we wanted to pivot, in this case phones. Second, we specified which columns to use, with the expression cols = -Year, which meant “all columns except Year”. Finally, we told the function that the names of the columns should become a new variable called “Region”, and the values in those columns should become a new variable called “Telephones”.

The tidyr library also provides a function pivot_wider which performs the opposite transformation to pivot_longer. We will not have occasion to use pivot_wider in this text.

Example As a more complex example, let’s look at the billboard data set provided with the tidyr package. This contains the weekly Billboard rankings for each song that entered the top 100 during the year 2000. Observe that each row is a song (track) and the track’s place on the charts is stored in up to 76 columns named wk1 through wk76. There are many NA in the data since none of these tracks were ranked for 76 consecutive weeks.

billboard %>% head(n=5)

## # A tibble: 5 x 79
##   artist track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##   <chr>  <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac  Baby… 2000-02-26      87    82    72    77    87    94    99    NA
## 2 2Ge+h… The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
## 3 3 Doo… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
## 4 3 Doo… Loser 2000-10-21      76    76    72    69    67    65    55    59
## 5 504 B… Wobb… 2000-04-15      57    34    25    17    17    31    36    49
## # … with 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## #   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>,
## #   wk49 <dbl>, wk50 <dbl>, wk51 <dbl>, wk52 <dbl>, wk53 <dbl>, wk54 <dbl>,
## #   wk55 <dbl>, wk56 <dbl>, wk57 <dbl>, wk58 <dbl>, wk59 <dbl>, wk60 <dbl>,
## #   wk61 <dbl>, wk62 <dbl>, wk63 <dbl>, wk64 <dbl>, wk65 <dbl>, wk66 <lgl>,
## #   wk67 <lgl>, wk68 <lgl>, wk69 <lgl>, wk70 <lgl>, wk71 <lgl>, wk72 <lgl>,
## #   wk73 <lgl>, wk74 <lgl>, wk75 <lgl>, wk76 <lgl>

This data is not tidy, since the week column names contain information. To do any sort of analysis on this data, the week needs to be a variable. We use pivot_longer to replace all 76 week columns with two columns, one called “week” and another called “rank”:

long.bill <- billboard %>%
  pivot_longer(cols = wk1:wk76,
               names_to = "week", values_to = "rank",
               values_drop_na = TRUE)
long.bill

## # A tibble: 5,307 x 5
##    artist  track                   date.entered week   rank
##    <chr>   <chr>                   <date>       <chr> <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk1      87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk2      82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk3      72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk4      77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk5      87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk6      94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk7      99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02   wk1      91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02   wk2      87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02   wk3      92
## # … with 5,297 more rows

The values_drop_na = TRUE argument removes the NA values from the rank column while pivoting.

Which 2000 song spent the most weeks on the charts? This question would have been near impossible to answer without tidying the data. Now we can group by track and count the number of times it was ranked:

long.bill %>%
  group_by(track) %>%
  summarize(weeks_on_chart = n()) %>%
  top_n(n=1, wt=weeks_on_chart)

## # A tibble: 1 x 2
##   track  weeks_on_chart
##   <chr>           <int>
## 1 Higher             57

“Higher”, by Creed, entered the charts in 2000 and spent 57 weeks in the top 100 (though not consecutively!)

5.6.2 Using join to merge data frames

In this section, we briefly introduce the join family of functions, by focusing on left_join. The function left_join joins two data frames together into one. The syntax is left_join(x, y, by = NULL), where x and y are data frames and by is a list of columns that you want to join the data frames by (there are additional arguments that we won’t be using). The way I remember what left_join does is that it adds the columns of y to the columns of x by matching on columns with the same names.

As an easy example, consider the band_members and band_instruments data sets, both of which are included in the dplyr package.

band_members

## # A tibble: 3 x 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles

band_instruments

## # A tibble: 3 x 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar

left_join(band_members, band_instruments)

## Joining, by = "name"

## # A tibble: 3 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass

Now, what happens if Paul learns how to play the drums as well?

band_instruments

## # A tibble: 4 x 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar
## 4 Paul  drums

left_join(band_members, band_instruments)

## Joining, by = "name"

## # A tibble: 4 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass  
## 4 Paul  Beatles drums

We see that if there are multiple matches to name, then left_join makes multiple rows in the new data frame.

Good, now, when would you use this? Well, many times you have data from multiple sources that you want to combine into one data frame. Consider the data sets 2012_alabama_election and unemp, which are available on this book’s website. These data sets give the by county election results from 2012 in Alabama, and the population and unemployment rate of all counties in the US. (Note: unemp is from the mapproj package. You may also install that package and use data("unemp")).)

el <- read.csv("http://stat.slu.edu/~speegle/_book_data/2012_alabama_election.csv", stringsAsFactors = FALSE)
unemp <- read.csv("http://stat.slu.edu/~speegle/_book_data/unemp.csv", stringsAsFactors = FALSE)

Suppose we want to create a new data frame that contains both the election results and the unemployment data. That is a job for left_join.

combined_data <- left_join(el, unemp, by = c("FIPS" = "fips"))

Note that the two data frames didn’t have any column names in common, so I had to specify the columns that I wanted to join by.

head(combined_data)

##   State.Postal County.Name FIPS Obama.vote   X. Romney.vote X..1   pop unemp
## 1           AL     Alabama    0     793620 38.4     1252453 60.7    NA    NA
## 2           AL     Autauga 1001       6354 26.6       17366 72.6 23288   9.7
## 3           AL     Baldwin 1003      18329 21.6       65772 77.4 81706   9.1
## 4           AL     Barbour 1005       5873 51.3        5539 48.3  9703  13.4
## 5           AL        Bibb 1007       2200 26.2        6131 73.1  8475  12.1
## 6           AL      Blount 1009       2961 12.3       20741 86.5 25306   9.9

Word to the wise: when dealing with spatial data that is organized by county, it is much easier to use the fips code than the county name, because there are variances in capitalization and other issues that come up when trying to use county names. We will return to this combined data set in order to visualize it in Section 6.9

5.7 The apply family

The focus of this book is on using dplyr tools when manipulating data sets. However, there will be times when it will be useful to know the base R tools for doing some of the same tasks. At a minimum, it can be useful to know these tools when reading other people’s code! The apply family consists of quite a few functions, including apply, sapply, lapply, vapply, tapply, mapply and even replicate! In this section, we will focus on apply and sapply.

All of the functions in the apply family are implicit loops, like replicate. The typical usage of apply is apply(X, MARGIN, FUN). The function apply applies the function FUN to either the rows or columns of X, depending on whether MARGIN is 1 (rows) or 2 (columns). So, the argument X is typically a matrix or data frame, MARGIN is either 1 or 2, and FUN is a function, such as sum or mean.

As an example, consider the data set USJudgeRatings, which contains the ratings of US judges by lawyers on various facets. Suppose you wanted to compute the mean rating of each judge. That is easy, using the apply function.

head(USJudgeRatings)

##                CONT INTG DMNR DILG CFMG DECI PREP FAMI ORAL WRIT PHYS RTEN
## AARONSON,L.H.   5.7  7.9  7.7  7.3  7.1  7.4  7.1  7.1  7.1  7.0  8.3  7.8
## ALEXANDER,J.M.  6.8  8.9  8.8  8.5  7.8  8.1  8.0  8.0  7.8  7.9  8.5  8.7
## ARMENTANO,A.J.  7.2  8.1  7.8  7.8  7.5  7.6  7.5  7.5  7.3  7.4  7.9  7.8
## BERDON,R.I.     6.8  8.8  8.5  8.8  8.3  8.5  8.7  8.7  8.4  8.5  8.8  8.7
## BRACKEN,J.J.    7.3  6.4  4.3  6.5  6.0  6.2  5.7  5.7  5.1  5.3  5.5  4.8
## BURNS,E.B.      6.2  8.8  8.7  8.5  7.9  8.0  8.1  8.0  8.0  8.0  8.6  8.6

apply(USJudgeRatings, 1, mean)

##   AARONSON,L.H.  ALEXANDER,J.M.  ARMENTANO,A.J.     BERDON,R.I.    BRACKEN,J.J. 
##        7.291667        8.150000        7.616667        8.458333        5.733333 
##      BURNS,E.B.   CALLAHAN,R.J.      COHEN,S.S.       DALY,J.J.    DANNEHY,J.F. 
##        8.116667        8.858333        5.458333        8.516667        7.891667 
##       DEAN,H.H.     DEVITA,H.J.   DRISCOLL,P.J.     GRILLO,A.E.  HADDEN,W.L.JR. 
##        7.458333        7.125000        7.366667        6.683333        7.850000 
##     HAMILL,E.C.     HEALEY.A.H.       HULL,T.C.       LEVINE,I.   LEVISTER,R.L. 
##        7.450000        6.866667        7.400000        7.808333        6.608333 
##     MARTIN,L.F.    MCGRATH,J.F.    MIGNONE,A.F.     MISSAL,H.M.     MULVEY,H.M. 
##        7.091667        6.783333        5.841667        7.458333        8.450000 
##      NARUK,H.J.    O'BRIEN,F.J. O'SULLIVAN,T.J.       PASKEY,L.    RUBINOW,J.E. 
##        8.783333        7.941667        8.483333        8.066667        8.791667 
##      SADEN.G.A. SATANIELLO,A.G.       SHEA,D.M.    SHEA,J.F.JR.      SIDOR,W.J. 
##        7.775000        7.800000        8.191667        8.500000        5.808333 
##   SPEZIALE,J.A.     SPONZO,M.J.  STAPLETON,J.F.      TESTO,R.J. TIERNEY,W.L.JR. 
##        8.183333        7.841667        7.683333        7.108333        7.983333 
##       WALL,R.A.     WRIGHT,D.B.   ZARRILLI,K.J. 
##        7.016667        7.941667        7.425000

Then, to pull the highest mean rating:

max(apply(USJudgeRatings, 1, mean))

## [1] 8.858333

which.max(apply(USJudgeRatings,1,mean))

## CALLAHAN,R.J. 
##             7

We see that RJ Callahan had the highest rating of 8.858333.

Similarly, it would be easy to compute the mean rating of all of the judges in each category.

apply(USJudgeRatings, 2, mean)

##     CONT     INTG     DMNR     DILG     CFMG     DECI     PREP     FAMI 
## 7.437209 8.020930 7.516279 7.693023 7.479070 7.565116 7.467442 7.488372 
##     ORAL     WRIT     PHYS     RTEN 
## 7.293023 7.383721 7.934884 7.602326

And we see that judges scored highest on the Integrity scale.

The sapply function has as its typical usage sapply(X, FUN), where X is a vector, and FUN is the function that we wish to act on each element of X. The function sapply then stores the results of the function FUN in as simplified a version as possible. (This leads sapply to get a bad name when writing source code, but is very useful when you are doing exploratory analysis.) As a simple example, let’s create a vector containing the numbers 1 through 10, squared.

sapply(X = 1:10, FUN = function(x) x^2)

##  [1]   1   4   9  16  25  36  49  64  81 100

Note that sapply takes each element of the vector X, squares it, and stores the squared value in a vector. The symbol x in function(x) x^2 is a dummy variable, and can be replaced by any legal R variable name.

A common usage of sapply in interactive coding is to take advantage of a known structure of a data frame. For example, suppose we wanted to scale all of the numeric variables in iris so that they have mean 0 and standard deviation 1. It would be simple enough to just look at iris to determine which are numeric, but we can automate it for use with other data frames.

is_numeric_col <- sapply(1:ncol(iris), function(x) is.numeric(iris[1,x]))
numeric_col <- which(is_numeric_col)
iris[,numeric_col] <- apply(iris[,numeric_col], 2, function(x) (x - mean(x))/sd(x))

Finally, we note that replicate is a “convenience wrapper” for a common usage of sapply, so that replicate(100, {expr}) is equivalent to sapply(1:100, function(x) {expr}), where in this case, the function inside of sapply will have no dependence on x.

5.8 Data science communities

If you are serious about learning data science, the best thing you can do is to practice with real data. Finding appropriate data to practice on can be a challenge for beginners, but happily the R world abounds with online communities that share interesting data. Here are a few:

Kaggle (kaggle.com)
Tidy Tuesday (Twitter: #TidyTuesday)
Data is plural (https://tinyletter.com/data-is-plural)

Both beginners and experts post visualizations, example code, and discussions of data from these sources regularly. Look at other developeRs code and decide what you like, and what you don’t. Incorporate their ideas into your own work!

5.9 Exercises

Consider the mpg data set in the ggplot2 package.
1. Which car(s) had the highest highway gas mileage? (For the purposes of this question, consider each observation a different car.)
2. Compute the mean city mileage for compact cars.
3. Compute the mean city mileage for each class of cars, and arrange in decreasing order.
4. Which cars have the smallest absolute difference between highway mileage and city mileage? (For the purposes of this question, consider each observation a different “car”.)
5. Compute the mean highway mileage for each year, and arrange in decreasing order.
This question uses the DrinksWages from the HistData library. This data, gathered by Karl Pearson in 1910, was a survey of people working in various trades (bakers, plumbers, goldbeaters, etc.). The trades are assigned class values of A, B, or C based on required skill. For each trade, he counted the number of workers who drink (drinks), number of sober workers (sober), and recorded wage information (wage). There is also a column n = drinks + sober which is the total number of workers surveyed for each trade.
1. Compute the mean wages for each class, A, B, and C.
2. Find the three trades with the highest proportion of drinkers. Consider only trades with 10 or more workers in the survey.
Consider the dataset oly12 from the VGAMdata package (that you will probably need to install). It has individual competitor information from the Summer 2012 London Olympic Games.
1. According to this data, which country won the most medals? How many did that country win? (You need to sum Gold, Silver, and Bronze)
2. Which countries were the heaviest? Compute the mean weight of male athletes for all countries with at least 10 competitors, and report the top three.
This exercise uses the billboard data from the tidyr library.
1. Which artist had the most tracks on the charts in 2000?
2. Which track from 2000 spent the most weeks at #1? (This requires tidying the data as described in section ??)

Exercises 5 - 9 all use the movieLensData from this textbook’s data collection.

What is the movie with the highest mean rating that has been rated at least 30 times?
Which genre has been rated the most? (For the purpose of this, consider Comedy and Comedy|Romance completely different genres, for example.)
Which movie in the genre Comedy|Romance that has been rated at least 75 times has the lowest mean rating? Which has the highest mean rating?
Which movie that has a mean rating of 4 or higher has been rated the most times?
Which user gave the highest mean ratings?

Exercises 10 - 15 all use the Batting data set from the Lahman library. This gives the batting statistics of every player who has played baseball from 1871 through the present day.

Which player has been hit-by-pitch the most number of times?
How many doubles were hit in 1871?
Which team has the most number of home runs?
Which player who has played in at least 500 games has scored the most number of runs per game?
1. Which player has the most lifetime at bats without ever having hit a home run?
2. Which active player has the most lifetime at bats without ever having hit a home run? (An active player is someone with an entry in the most recent year of the data set).
Make sure to take into account stint in these problems.
1. Verify that Curtis Granderson hit the most triples in a single season since 1960.
2. In which season did the major league leader in triples have the fewest triples?
3. In which season was there the biggest difference between the major league leader in stolen bases (SB) and the player with the second most stolen bases?

Exercises 16 - 23 all use the Pitching data set from the Lahman library. This gives the pitching statistics of every pitcher who has played baseball from 1871 through the present day.

1. Which pitcher has won (W) the most number of games?
2. Which pitcher has lost (L) the most number of games?
Which pitcher has hit the most opponents with a pitch (HBP)?
Which year had the most number of complete games (CG)?
Among pitchers who have won at least 100 games, which has the highest winning percentage? (Winning percentage is wins divided by wins + losses.)
Among pitchers who have struck out at least 500 batters, which has the highest strikeout to walk ratio? (Strikeout to walk ratio is SO/BB.)
List the pitchers for the St Louis Cardinals (SLN) in 2006 with at least 30 recorded outs (IPouts), sorted by ERA from lowest to highest.
A balk (BK) is a subtle technical mistake that a pitcher can make when throwing. What were the top five years with the most balks in major league baseball history? Why was 1988 known as “the year of the balk”?
Which pitcher has the most outs pitched (IPouts) of all the pitchers who have more bases on balls (BB) than hits allowed (H)? Who has the second most?

Exercises 24 - 27 require the stringr library.

The data set words is built in to the stringr package.
1. How many words in this data set contain “ff”
2. What percentage of these words start with “s”?
The data set sentences is built in to the stringr package.
1. What percentage of sentences contain the string “the”?
2. What percentage of sentences contain the word “the” (so, either “the” or “The”)?
3. What percentage of sentences start with the word “The”?
4. Find the one sentence that has both an “x” and a “q” in it.
5. Which words are the most common words that a sentence ends with?
The data set fruit is built in to the stringr package.
1. How many fruits have the word “berry” in their name?
2. Some of these fruit have the word “fruit” in their name. Find these fruit and remove the word “fruit” to create a list of words that can be made into fruit. (Hint: use str_remove)
This problem uses the babynames data frame from the babynames library. Say that a name is popular if it was given to 1000 or more babies of a single sex. How many popular female names were there in 2015? What percentage of these popular female names ended in the letter ‘a’?

Rating should almost certainly have been reclassified as an ordered factor, and it is not clear that taking the mean is really a valid thing to do.↩