class: center, title-slide # STAT 302, Lecture Slides 2 ## Programming Fundamentals ### Peter Gao (adapted from slides by Bryan Martin) --- # Before class Discuss with your neighbor: ```r to_do_list <- list(saturday = c("sleep", "cook noodles", "eat noodles"), sunday = c("wake up", "homework", "course prep")) # what are each of the following? to_do_list[1] to_do_list[1] to_do_list[1][1] to_do_list[[1]] to_do_list[1][[1]] to_do_list[[1]][2] ``` -- ``` ## $saturday ## [1] "sleep" "cook noodles" "eat noodles" ``` ``` ## $saturday ## [1] "sleep" "cook noodles" "eat noodles" ``` ``` ## $saturday ## [1] "sleep" "cook noodles" "eat noodles" ``` --- # Before class ```r to_do_list <- list(saturday = c("sleep", "cook noodles", "eat noodles"), sunday = c("wake up", "homework", "course prep")) # what are each of the following? to_do_list[[1]] ``` ``` ## [1] "sleep" "cook noodles" "eat noodles" ``` ```r to_do_list[1][[1]] ``` ``` ## [1] "sleep" "cook noodles" "eat noodles" ``` ```r to_do_list[[1]][2] ``` ``` ## [1] "cook noodles" ``` --- # Before class How could I write a function that takes in a whole number `x` and outputs all of its factors? ```r get_factors <- function(x) { ... } ``` --- # Before class How could I write a function that takes in a whole number `x` and outputs all of its factors? ```r get_factors <- function(x) { factors <- numeric() i <- 1 while(i <= x) { if (x %% i == 0) { factors <- c(factors, i) } i <- i + 1 } return(factors) } ``` --- # Before class How could I write a function that takes in a whole number `x` and outputs all of its factors? ```r get_factors(10) ``` ``` ## [1] 1 2 5 10 ``` --- # Outline 1. Control flow: `if`, `else`, `while` 2. Loops: `for` and `while` 3. Functions 4. Packages 5. Data 6. Managing Data .middler[**Goal:** Use R's programming capabilities to build efficient functions and workflows.] --- class: inverse .sectionhead[Part 1: Control Flow] --- layout: true # Control Flow --- ## `if` statements * `if` statement give our computer conditions for chunks of code * If our condition is `TRUE`, then the chunk evaluates * If our condition is `FALSE`, then it does not * We must give our condition as a single Boolean --- ## `if` statements ```r x <- 1 # Conditions go in parenthesis after if if (x > 0) { # code chunks get surrounded by curly brackets print(paste0("x is equal to ", x, ", a positive number!")) } ``` -- ``` ## [1] "x is equal to 1, a positive number!" ``` --- ## `if` statements ```r x <- -1 # Conditions go in parenthesis after if if (x > 0) { # code chunks get surrounded by curly brackets print(paste0("x is equal to ", x, ", a positive number!")) } ``` --- ## `else` statements * We can use `else` to specify what we want to happen when our condition is `FALSE` ```r x <- 1 if (x > 0) { print(paste0("x is equal to ", x, ", a positive number!")) } else { print(paste0("x is equal to ", x, ", a negative number!")) } ``` -- ``` ## [1] "x is equal to 1, a positive number!" ``` --- ## `else` statements * We can use `else` to specify what we want to happen when our condition is `FALSE` ```r x <- -1 if (x > 0) { print(paste0("x is equal to ", x, ", a positive number!")) } else { print(paste0("x is equal to ", x, ", a negative number!")) } ``` ``` ## [1] "x is equal to -1, a negative number!" ``` --- ## `else if` * Use `else if` to set a sequence of conditions * The final `else` will evaluate anything left ```r x <- 1 if (x > 0) { paste0("x is equal to ", x, ", a positive number!") } else if (x < 0) { paste0("x is equal to ", x, ", a negative number!") } else { paste0("x is equal to ", x, "!") } ``` ``` ## [1] "x is equal to 1, a positive number!" ``` --- ## `else if` ```r x <- -1 if (x > 0) { paste0("x is equal to ", x, ", a positive number!") } else if (x < 0) { paste0("x is equal to ", x, ", a negative number!") } else { paste0("x is equal to ", x, "!") } ``` ``` ## [1] "x is equal to -1, a negative number!" ``` --- ## `else if` ```r x <- 0 if (x > 0) { paste0("x is equal to ", x, ", a positive number!") } else if (x < 0) { paste0("x is equal to ", x, ", a negative number!") } else { paste0("x is equal to ", x, "!") } ``` ``` ## [1] "x is equal to 0!" ``` --- layout: false layout: true # Control Flow: Examples --- ## Divisibility Suppose we want to check if `x` is divisible by 5 and print out the answer. What should `CONDITION` be? ```r x <- 5 if (CONDITION) { IF TRUE, DO THIS } else { ELSE, DO THIS } ``` --- ## Divisibility Suppose we want to check if `x` is divisible by 5 and print out the answer. What should `CONDITION` be? ```r x <- 5 # modulo operator if (x %% 5 == 0) { print("divisible by 5") } else { print("not divisible by 5") } ``` -- ``` ## [1] "divisible by 5" ``` --- ## Check length of strings Note: We will need the `stringr` package for this ```r # Run this if you have never installed stringr before! # install.packages("stringr") library(stringr) ``` ```r x <- "cat" if (str_length(x) <= 10) { cat("x is a pretty short string!") } else { cat("x is a pretty long string!") } ``` ``` ## x is a pretty short string! ``` --- ```r x <- "A big fluffy cat with orange fur and stripes" if (str_length(x) <= 10) { cat("x is a pretty short string!") } else { cat("x is a pretty long string!") } ``` ``` ## x is a pretty long string! ``` --- ## Check class ```r x <- 5 if (is.numeric(x)) { cat("x is a numeric!") } else if (is.character(x)) { cat("x is a character!") } else { cat("x is some class I didn't check for in my code!") } ``` ``` ## x is a numeric! ``` --- ## Check class ```r x <- list() if (is.numeric(x)) { cat("x is a numeric!") } else if (is.character(x)) { cat("x is a character!") } else { cat("x is some class I didn't check for in my code!") } ``` ``` ## x is some class I didn't check for in my code! ``` --- layout: false class: inverse .sectionhead[Part 2: for loops] --- layout: true # Loops --- ## `for` loops `for` loops iterate along an input vector, stores the current value of the vector as a variable, and repeatedly evaluates a code chunk until the vector is exhausted ```r for (i in 1:8) { print(i) } ``` ``` ## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ## [1] 6 ## [1] 7 ## [1] 8 ``` --- ## `while` loops `while` loops continuously evaluate the inner code chunk until the condition is `FALSE`. Be careful here! It is possible to get stuck in an infinite loop! ```r x <- 0 while (x < 5) { cat("x is currently", x, ". Let's increase it by 1!") x <- x + 1 } ``` ``` ## x is currently 0 . Let's increase it by 1!x is currently 1 . Let's increase it by 1!x is currently 2 . Let's increase it by 1!x is currently 3 . Let's increase it by 1!x is currently 4 . Let's increase it by 1! ``` --- ## `while` loops Let's see if we can clean up that output. Add `"\n"` to a string to force a line break. ```r x <- 0 while (x < 5) { cat("x is currently ", x, ". Let's increase it by 1! \n", sep = "") x <- x + 1 } ``` ``` ## x is currently 0. Let's increase it by 1! ## x is currently 1. Let's increase it by 1! ## x is currently 2. Let's increase it by 1! ## x is currently 3. Let's increase it by 1! ## x is currently 4. Let's increase it by 1! ``` --- layout: false layout: true # Loops: Examples --- ## String Input ```r string_vector <- c("a", "b", "c", "d", "e") for (mystring in string_vector) { print(mystring) } ``` ``` ## [1] "a" ## [1] "b" ## [1] "c" ## [1] "d" ## [1] "e" ``` --- ## Nested Loops ```r counter <- 0 for (i in 1:3) { for (j in 1:2) { counter <- counter + 1 cat("i = ", i, ", j = ", j, ", counter = ", counter, "\n", sep = "") } } ``` ``` ## i = 1, j = 1, counter = 1 ## i = 1, j = 2, counter = 2 ## i = 2, j = 1, counter = 3 ## i = 2, j = 2, counter = 4 ## i = 3, j = 1, counter = 5 ## i = 3, j = 2, counter = 6 ``` --- ## Nested Loops ```r for (i in 1:3) { for (j in 1:2) { print(i * j) } } ``` ``` ## [1] 1 ## [1] 2 ## [1] 2 ## [1] 4 ## [1] 3 ## [1] 6 ``` --- ## Filling in a vector Note: Usually, this is an inefficient way to do this! Try to vectorize code wherever possible! ```r # Inefficient x <- rep(NA, 5) for (i in 1:5) { x[i] <- i * 2 } x ``` ``` ## [1] 2 4 6 8 10 ``` ```r # Much better x <- seq(2, 10, by = 2) x ``` ``` ## [1] 2 4 6 8 10 ``` --- ## Filling in a vector ```r library(stringr) x <- rep(NA, 5) my_strings <- c("a", "a ", "a c", "a ca", "a cat") for (i in 1:5) { x[i] <- str_length(my_strings[i]) print(x) } ``` ``` ## [1] 1 NA NA NA NA ## [1] 1 2 NA NA NA ## [1] 1 2 3 NA NA ## [1] 1 2 3 4 NA ## [1] 1 2 3 4 5 ``` --- ## Filling in a matrix Note: Usually, this is an inefficient way to do this! Try to vectorize code wherever possible! ```r x <- matrix(NA, nrow = 4, ncol = 3) for (i in 1:4) { for (j in 1:3) { x[i, j] <- i * j } } x ``` ``` ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 2 4 6 ## [3,] 3 6 9 ## [4,] 4 8 12 ``` --- ## Continue until positive sample ```r set.seed(3) x <- -1 while (x < 0) { x <- rnorm(1) print(x) } ``` ``` ## [1] -0.9619334 ## [1] -0.2925257 ## [1] 0.2587882 ``` ```r x ``` ``` ## [1] 0.2587882 ``` --- layout: false class: inverse .sectionhead[Part 3: Functions] --- layout: true # Functions --- We've already seen and used several functions, but you can also create your own! This is incredibly useful when: * You use the same code chunk repeatedly * You want to generalize your workflow to multiple inputs * You want others to be able to use your code * You want to complete your assignments for STAT 302 --- ## Anatomy of a function ```r function_name <- function(param1, param2 = "default") { # Body of the function return(output) } ``` * `function_name`: the name you want to give your function, what you will use to call it * `function()`: call this to define a function * `param1`, `param2`: function parameters, what the user inputs. You can assign default values by setting them equal to something in the function definition * **Body**: the actual code that is executed * `return()`: is what your function will return to the user --- layout: false layout: true # Functions: Examples --- ## Square a number, add 2 ```r square_plus_2 <- function(x) { y <- x^2 + 2 return(y) } square_plus_2(4) ``` ``` ## [1] 18 ``` ```r square_plus_2(10) ``` ``` ## [1] 102 ``` ```r square_plus_2(1:5) ``` ``` ## [1] 3 6 11 18 27 ``` --- ```r square_plus_2("some string") ``` ``` ## Error in x^2: non-numeric argument to binary operator ``` What happened here? We wrote a function for numerics only but didn't check the input! --- Let's try making our function more robust by adding a `stop` ```r square_plus_2 <- function(x) { if (!is.numeric(x)) { stop("x must be numeric!") } else { y <- x^2 + 2 return(y) } } square_plus_2("some string") ``` ``` ## Error in square_plus_2("some string"): x must be numeric! ``` --- ## Check if the input is positive ```r check_pos <- function(x) { if (x > 0) { return(TRUE) } else if (x < 0) { return(FALSE) } else { return(paste0("x is equal to ", x, "!")) } } check_pos(-3) ``` ``` ## [1] FALSE ``` ```r store_output <- check_pos(0) store_output ``` ``` ## [1] "x is equal to 0!" ``` --- ## Make a table We'll use `str_c` from the `stringr` package for this function. ```r library(stringr) my_summary <- function(input, percentiles = c(.05, .5, .95)) { if (!is.numeric(input) | !is.numeric(percentiles)) { stop("The input and percentiles must be numeric!") } if (max(percentiles) > 1 | min(percentiles) < 0) { stop("Percentiles must all be in [0, 1]") } # Convert percentiles to character percent, append " Percentile" to each labels <- str_c(percentiles * 100, " Percentile") output <- quantile(input, probs = percentiles) names(output) <- labels return(output) } ``` --- ## Make a table ```r x <- rnorm(100) my_summary(x) ``` ``` ## 5 Percentile 50 Percentile 95 Percentile ## -1.22236488 0.06183487 1.22655423 ``` ```r my_summary(x, percentiles = c(.07, .5, .63, .91)) ``` ``` ## 7 Percentile 50 Percentile 63 Percentile 91 Percentile ## -1.13785677 0.06183487 0.36358152 1.16185072 ``` --- ## Make a table ```r my_summary(c("string1", "string2")) ``` ``` ## Error in my_summary(c("string1", "string2")): The input and percentiles must be numeric! ``` ```r my_summary(x, percentiles = c(-7, .5, 1.3)) ``` ``` ## Error in my_summary(x, percentiles = c(-7, 0.5, 1.3)): Percentiles must all be in [0, 1] ``` --- ## Function with iteration ```r my_sum <- function(x) { total <- 0 for (i in 1:length(x)) { total <- total + x[i] } return(total) } my_sum(1:5) ``` ``` ## [1] 15 ``` --- layout: false class: inverse .sectionhead[Style guide!] --- layout: true # Style guide! --- .middler[Once again, we will using a mix of the [Tidyverse style guide](https://style.tidyverse.org/) and the [Google style guide](https://google.github.io/styleguide/Rguide.html).] --- ## Function Names Strive to have function names based on verbs. Otherwise, standard variable name style guidelines apply! ```r # Good add_row() permute() # Bad row_adder() permutation() ``` --- ## Spacing Place a space before and after `()` when used with `if`, `for`, or `while`. ```r # Good if (condition) { x + 2 } # Bad if(condition){ x + 2 } ``` --- ## Spacing Place a space after `()` used for function arguments. ```r # Good if (debug) { show(x) } # Bad if(debug){ show(x) } ``` --- ## Code Blocks * `{` should be the last character on the line. Related code (e.g., an `if` clause, a function declaration, a trailing comma, ...) must be on the same line as the opening brace. It should be preceded by a single space. * The contents within code blocks should be indented by two spaces from where it started * `}` should be the first character on the line. --- ## Code Blocks ```r # Good if (y < 0) { message("y is negative") } if (y == 0) { if (x > 0) { log(x) } else { message("x is negative or zero") } } else { y^x } ``` --- ## Code Blocks ```r # Bad if (y<0){ message("Y is negative") } if (y == 0) { if (x > 0) { log(x) } else { message("x is negative or zero") } } else { y ^ x } ``` --- ## In-line Statments In general, it's ok to drop the curly braces for very simple statements that fit on one line. However, function calls that affect control flow (`return`, `stop`, etc.) should always go in their own `{}` block: ```r # Good y <- 10 x <- if (y < 20) "Too low" else "Too high" if (y < 0) { stop("Y is negative") } find_abs <- function(x) { if (x > 0) { return(x) } x * -1 } ``` --- ## In-line Statements In general, it's ok to drop the curly braces for very simple statements that fit on one line. However, function calls that affect control flow (`return`, `stop`, etc.) should always go in their own `{}` block: ```r # Bad if (y < 0) stop("Y is negative") if (y < 0) stop("Y is negative") find_abs <- function(x) { if (x > 0) return(x) x * -1 } ``` --- ## Long lines in functions If a function definition runs over multiple lines, indent the second line to where the definition starts. ```r # Good long_function_name <- function(a = "a long argument", b = "another argument", c = "another long argument") { # As usual code is indented by two spaces. } # Bad long_function_name <- function(a = "a long argument", b = "another argument", c = "another long argument") { # Here it's hard to spot where the definition ends and the # code begins } ``` --- ## `return` Strictly speaking, `return` is not necessary in a function definition. The function will output the last line of executable R code. The following function definitions will output the same results! ```r Add_Values <- function(x, y) { return(x + y) } Add_Values <- function(x, y) { x + y } ``` Note that our two guides disagree on which of these is preferable. Personally, I always make my `return` statements explicit, so I prefer the former. --- ## Commenting functions For now, when commenting functions, include (at least) 3 lines of comments: * a comment describing the purpose of a function * a comment describing each input * a comment describing the output The function body should be commented as usual! --- ```r # Good ---- # Function: square_plus_2, squares a number and then adds 2 # Input: x, must be numeric # Output: numeric equal to x^2 + 2 square_plus_2 <- function(x) { # check that x is numeric if (!is.numeric(x)) { stop("x must be numeric!") } else { # if numeric, then square and add 2 y <- x^2 + 2 return(y) } } ``` --- ```r # Bad ---- # Function for problem 2c square_plus_2 <- function(x) { if (!is.numeric(x)) { stop("x must be numeric!") } else { y <- x^2 + 2 return(y) } } ``` --- layout: false # Summary * Use `if` and `else` to set conditions * Use `for` and `while` to write loops * Functions include a input parameters, a body of code, and an output * Functions are essential for a good workflow! --- class: inverse .sectionhead[Part 4: Packages] --- layout: true # Packages --- ## What is an R package? * Packages bundle together code, data, and documentation in an easy to share way. * They come with functions that others have written for you to make your life easier, and greatly improve the power of R! * Packages are the reason we are learning about R in this course. * Packages can range from graphical software, to web scraping tools, statistical models for spatio-temporal data, microbial data analysis tools, and more! --- ## Where are packages? * The most popular package repository is the Comprehensive R Archive Network, or [CRAN](https://cran.r-project.org/) * As of making this slide, it includes over 16,000 packages * Other popular repositories include [Bioconductor](https://www.bioconductor.org/) and [Github](https://github.com/) --- ## How do I install packages? If a package is available on CRAN, like most packages we will use for this course, you can install it using `install.packages()`: ```r install.packages("PACKAGE_NAME_IN_QUOTES") ``` You can also install by clicking *Install* in the *Packages* tab through RStudio. For the most part, after you install a package, it is saved on your computer until you update R, and you will not need to re-install it. Thus, you should **never** include a call to `install.packages()` in any `.R` or `.Rmd` file! --- ## How do I use a package? After a package is installed, you can load it into your current R session using `library()`: ```r library(PACKAGE_NAME) # or library("PACKAGE_NAME") ``` Note that unlike `install.packages()`, you do not need to include the package name in quotes. --- ## How do I use a package? Loading a package must be done with each new R session, so you should put calls to `library()` in your `.R` and `.Rmd` files. Usually, I do that in the opening code chunk. If it is a `.Rmd`, I set the parameter `include = FALSE` to hide the messages and code, because they are usually unnecessary to the reader of my HTML. ```{r, include = FALSE} library(ggplot2) ``` --- layout: false class: inverse .sectionhead[Part 5: Data] --- # Tibbles `tibbles` are a special Tidyverse data frame from the `tibble` package. You can convert data frames to tibbles using `as_tibble()`, or you can create them similarly to data frames using `tibble()`. The biggest benefit of tibbles is that they display nicer in your R console, automatically truncating output and including variable type to print nicely. Tidyverse has (rightfully) decided rownames are obsolete, and so they do not include rownames by default. However, we can include our rownames as a variable using the parameter `rownames` in `as_tibble()`. --- # Tibbles ```r library(tibble) my_data <- data.frame("var1" = 1:3, "var2" = c("a", "b", "c"), "var3" = c(TRUE, FALSE, TRUE)) my_tibble <- as_tibble(my_data, rownames = "Observation") my_tibble ``` ``` ## # A tibble: 3 x 4 ## Observation var1 var2 var3 ## <chr> <int> <chr> <lgl> ## 1 1 1 a TRUE ## 2 2 2 b FALSE ## 3 3 3 c TRUE ``` --- layout: true # Tidy Data Principles --- There are three rules required for data to be considered tidy * Each variable must have its own column * Each observation must have its own row * Each value must have its own cell --- Seems simple, but can sometimes be tricky! What's untidy about the following data? <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Hospital </th> <th style="text-align:right;"> Diseased </th> <th style="text-align:right;"> Healthy </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 14 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 18 </td> </tr> <tr> <td style="text-align:left;"> C </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:left;"> D </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 16 </td> </tr> </tbody> </table> -- * **Observations:** the number of individuals at a given hospital and of a given disease status * **Variables:** the hospital, the disease status, the counts * **Values:** Hospital A, Hospital B, Hospital C, Hospital D, individual count values, *Disease Status Healthy*, *Disease Status Diseased* --- Problem: column headers are values, not variables! How can we tidy it up? -- <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Hospital </th> <th style="text-align:left;"> Status </th> <th style="text-align:right;"> Count </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:left;"> Diseased </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:left;"> Healthy </td> <td style="text-align:right;"> 14 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:left;"> Diseased </td> <td style="text-align:right;"> 15 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:left;"> Healthy </td> <td style="text-align:right;"> 18 </td> </tr> <tr> <td style="text-align:left;"> C </td> <td style="text-align:left;"> Diseased </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> C </td> <td style="text-align:left;"> Healthy </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:left;"> D </td> <td style="text-align:left;"> Diseased </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> D </td> <td style="text-align:left;"> Healthy </td> <td style="text-align:right;"> 16 </td> </tr> </tbody> </table> --- Another example: <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Country </th> <th style="text-align:right;"> Year </th> <th style="text-align:right;"> m1624 </th> <th style="text-align:right;"> m2534 </th> <th style="text-align:right;"> f1624 </th> <th style="text-align:right;"> f2534 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:right;"> 49 </td> <td style="text-align:right;"> 55 </td> <td style="text-align:right;"> 47 </td> <td style="text-align:right;"> 41 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:right;"> 34 </td> <td style="text-align:right;"> 33 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 43 </td> </tr> </tbody> </table> -- * **Observations:** the number of individuals in a given country, in a given year, of a given gender, and in a given age group * **Variables:** Country, year, gender, age group, counts * **Values:** Country A, Country B, Year 2018, Gender "m", Gender "f", Age Group "1624", Age Group "2534", individual counts --- <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Country </th> <th style="text-align:right;"> Year </th> <th style="text-align:left;"> Gender </th> <th style="text-align:left;"> Age_Group </th> <th style="text-align:right;"> Counts </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 16-24 </td> <td style="text-align:right;"> 49 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 25-34 </td> <td style="text-align:right;"> 55 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:left;"> f </td> <td style="text-align:left;"> 16-24 </td> <td style="text-align:right;"> 47 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:left;"> f </td> <td style="text-align:left;"> 25-34 </td> <td style="text-align:right;"> 41 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 16-24 </td> <td style="text-align:right;"> 34 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 25-34 </td> <td style="text-align:right;"> 33 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:left;"> f </td> <td style="text-align:left;"> 16-24 </td> <td style="text-align:right;"> 50 </td> </tr> <tr> <td style="text-align:left;"> B </td> <td style="text-align:right;"> 2018 </td> <td style="text-align:left;"> f </td> <td style="text-align:left;"> 25-34 </td> <td style="text-align:right;"> 43 </td> </tr> </tbody> </table> --- ## How to tidy data? 1. Identify the observations, variables, and values 2. Ensure that each observation has its own row * Be careful for individual observations spread over multiple tables/Excel files/etc, or multiple types of observations within a single table (this would result in many empty cells) 3. Ensure that each variable has its own column * Be careful for variables spread over two columns, multiple variables within a single column, variables as rows 4. Ensure that each value has its own cell * Be careful for values as column headers --- ## Why tidy data? * Easier to read data * Easier to analyze and plot using standard software (required for `ggplot2`) * Easier to understand what the data represents * Fewer issues with missing values --- ## Using R to tidy data ``` ## ## Attaching package: 'dplyr' ``` ``` ## The following object is masked from 'package:kableExtra': ## ## group_rows ``` ``` ## The following objects are masked from 'package:stats': ## ## filter, lag ``` ``` ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union ``` ``` ## # A tibble: 18 x 11 ## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Agnostic 27 34 60 81 76 137 122 ## 2 Atheist 12 27 37 52 35 70 73 ## 3 Buddhist 27 21 30 34 33 58 62 ## 4 Catholic 418 617 732 670 638 1116 949 ## 5 Don’t k… 15 14 15 11 10 35 21 ## 6 Evangel… 575 869 1064 982 881 1486 949 ## 7 Hindu 1 9 7 9 11 34 47 ## 8 Histori… 228 244 236 238 197 223 131 ## 9 Jehovah… 20 27 24 24 21 30 15 ## 10 Jewish 19 19 25 25 30 95 69 ## 11 Mainlin… 289 495 619 655 651 1107 939 ## 12 Mormon 29 40 48 51 56 112 85 ## 13 Muslim 6 7 9 10 9 23 16 ## 14 Orthodox 13 17 23 32 32 47 38 ## 15 Other C… 9 7 11 13 13 14 18 ## 16 Other F… 20 33 40 46 49 63 46 ## 17 Other W… 5 2 3 4 2 7 3 ## 18 Unaffil… 217 299 374 365 341 528 407 ## # … with 3 more variables: $100-150k <dbl>, >150k <dbl>, ## # Don't know/refused <dbl> ``` --- ## Using R to tidy data ``` ## # A tibble: 180 x 3 ## religion income frequency ## <chr> <chr> <dbl> ## 1 Agnostic <$10k 27 ## 2 Agnostic $10-20k 34 ## 3 Agnostic $20-30k 60 ## 4 Agnostic $30-40k 81 ## 5 Agnostic $40-50k 76 ## 6 Agnostic $50-75k 137 ## 7 Agnostic $75-100k 122 ## 8 Agnostic $100-150k 109 ## 9 Agnostic >150k 84 ## 10 Agnostic Don't know/refused 96 ## # … with 170 more rows ``` --- ## A final reference Hadley Wickham is the ultimate resource on tidy data principles. [Here is a fantastic reference going through all these principles in more detail and with more examples.](https://vita.had.co.nz/papers/tidy-data.pdf) --- layout: false class: inverse .sectionhead[Part 6: Managing Data] --- layout: true # Working Directory --- ## Seeing your working directory A **working directory** is the filepath R uses to save and look for data. You can check for your current working directory using `getwd()` ```r getwd() ``` ``` ## [1] "/Users/pgao/Dropbox/teaching/STAT302-AUT2021/files/slides" ``` This location is where R will look by default! --- ## Changing your working directory You can change your working directory using `setwd()`. ```r setwd("/Users/Peter/Desktop/STAT302") ``` You can use the shorthand `..` to reference a parent directory relative to where you are now. ```r setwd("..") getwd() ``` ``` ## [1] "/Users/pgao/Dropbox/teaching/STAT302-AUT2021/files" ``` --- ## Changing your working directory We can also reference the current directory using the shorthand `.`. ```r setwd("./STAT302/Slides") ``` ```r getwd() ``` ``` ## [1] "/Users/pgao/Dropbox/teaching/STAT302-AUT2021/files/slides" ``` --- ## Working directories and R Markdown Do not change your working directory inside R Markdown files! By default, R Markdown sets the filepath they are in as the working directory. Changing this can (will) mess up your analysis, and make your work less reproducible. --- ## Saving Data You can save single R objects as `.rds` files using `saveRDS()`, multiple R objects as `.RData` or `.rda` files using `save()`, and your entire workspace as `.RData` using `save.image()`. ```r object1 <- 1:5 object2 <- c("a", "b", "c") # save only object1 saveRDS(object1, file = "object1_only.rds") # save object1 and object2 save(object1, object2, file = "both_objects.RData") # save my entire workspace save.image(file = "entire_workspace.RData") ``` --- ## Saving Data In general, I recommend using `.RData` for multiple objects, and I recommend against using `save.image()`, basically ever. `save.image()` should never be a part of your workflow. Personally, I only use it if I need to quickly close R and want to come back to exactly where I was later. (For example, a coffee shop I was working at closed). I will always delete the file later so it does not mess with my workflow. --- ## Loading Data You can load `.rds` files using `readRDS()` and `.Rdata` and `.rda` files using `load()`. ```r # load only object1 readRDS("object1_only.rds") # load object1 and object2 load("both_objects.RData") # load my entire workspace load("entire_workspace.RData") ``` --- ## Notes on Saving and Loading R Data The values in quotes are all filepaths, and by default, R will search for these objects in your current working directory. You can change where R searches for images by adjusting this filepath. For example, if you save your data in a `Data` subfolder within your working directory, you might try ```r load("./Data/my_data.RData") ``` --- ## Other types of data Often, you will read and write files as **c**omma **s**eparated **v**alues, or `.csv`. You can do this by navigating *File > Import Dataset* in the menu bar, but generally I recommend doing it manually using the `readr` package. You will need to do so if loading data is part of your work flow, such as if it is required for an R Markdown writeup. ```r library(readr) # read a .csv file in a "Data" subfolder read_csv("./Data/file.csv") # save a .csv file in a "Data" subfolder write_csv("./Data/data_output.csv") ``` `readr` can also handle many more types of data! See more details about `readr` using the fantastic cheat sheet available [here.](https://rstudio.com/resources/cheatsheets/) --- ## Working Directories Summary * Working directories are the default filepaths R uses to save and load files * When working in a `.Rmd`, your default filepath is wherever the `.Rmd` is stored, and you should leave it there * You can change your working directory with `setwd()`. * You can reference your current working directory using `.` and the parent directory of your current working directory using `..` For larger analysis projects, I recommend using R projects to automatically manage your working directory for you! --- layout: false layout: true # Projects --- Good file organization requires you to keep all your input data, R scripts, output data and results, and figures together. You can do this using **Projects**. You can create a project by going to *File > New Project*. If you want your project in a folder you have already created, select *Existing Directory*. If you want RStudio to automatically make you a new folder with a project, select *New Directory*. Then select *Empty Project* to create a standard project. This will create a `.Rproj` file on your computer. When working with a project, save and manage your work as usual. When you close and re-open R, *do so by double-clicking on your `.Rproj` file!* This will automatically open everything as you left it, except your environment will be fresh, helping with reproducibility. --- ## Benefits of Projects * Automatically manages your working directory, even if you move the project file * Remembers your working directory and command history, all the files you were working on are still open. * Helps with reproducibility. You can share R project files and the project will load on other computer exactly as it does on yours. * Helps keep your separate analyses separate. For example, you won't need to worry if you defined a variable `x` in multiple different analyses * Easy to integrate with version control such as git (more on this later!)