As usual, all code below should follow the style guidelines from the lecture slides.

Part 1. Read in text data

For this short lab, we will be using Project Gutenberg’s The Complete Works of William Shakespeare. Use the command read_lines() from the readr package to read the text available at “https://www.gutenberg.org/files/100/100-0.txt”. Make sure to store the text as a variable. Use the skip argument to discard the first 23 lines of extra info.

1a. Print the first 5 lines.

1b. Print the total number of lines.

1c. Remove all empty lines, then print the total number of lines.

(Hint: to remove empty elements from a string vector x, you could use x <- x[x != ""])

Part 2. String Manipulation

2a. Use str_c() to collapse the Shakespeare string vector into one large string. (Don’t try to print it!)

2b. Use str_split() to separate your string into words.

(Hint: you might get a list of length 1 that you have to convert to a vector. You could do this by using something like x <- unlist(x) or x <- x[[1]])

2c. Use a combination of table() and sort(..., decreasing = TRUE) argument to get a count of the unique words in Shakespeare’s complete works and print out the 10 most common words.

Part 3. Factors

3a. Use the code below to load the movies data, courtesy of the-numbers.com. Turn the genre and mpaa_rating variables into factors.

movies <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-10-23/movie_profit.csv")

3b. Collapse the Drama and Horror levels of genre into one Drama_Horror level.

3c. Create a new factor variable in the movies tibble, audience, that takes the value "all ages" for G and PG movies, "Teens and adults" for PG-13 movies, and "Adults only" for R movies.

Part 4. Dates

4a. Convert the release_date variable into a column of Date objects using an appropriate function.

4b. Create a new column for year that extracts the year of release for each movie.