As usual, all code below should follow the style guidelines from the lecture slides.
For this short lab, we will be using Project Gutenberg’s The Complete Works of William Shakespeare. Use the command read_lines()
from the readr
package to read the text available at “https://www.gutenberg.org/files/100/100-0.txt”. Make sure to store the text as a variable. Use the skip
argument to discard the first 23 lines of extra info.
1a. Print the first 5 lines.
1b. Print the total number of lines.
1c. Remove all empty lines, then print the total number of lines.
(Hint: to remove empty elements from a string vector x, you could use x <- x[x != ""]
)
2a. Use str_c()
to collapse the Shakespeare string vector into one large string. (Don’t try to print it!)
2b. Use str_split()
to separate your string into words.
(Hint: you might get a list of length 1 that you have to convert to a vector. You could do this by using something like x <- unlist(x)
or x <- x[[1]]
)
2c. Use a combination of table()
and sort(..., decreasing = TRUE)
argument to get a count of the unique words in Shakespeare’s complete works and print out the 10 most common words.
3a. Use the code below to load the movies
data, courtesy of the-numbers.com
. Turn the genre
and mpaa_rating
variables into factors.
movies <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-10-23/movie_profit.csv")
3b. Collapse the Drama
and Horror
levels of genre
into one Drama_Horror
level.
3c. Create a new factor variable in the movies tibble, audience
, that takes the value "all ages"
for G and PG movies, "Teens and adults"
for PG-13 movies, and "Adults only"
for R movies.
4a. Convert the release_date
variable into a column of Date
objects using an appropriate function.
4b. Create a new column for year
that extracts the year of release for each movie.