Math Camp, Lab 3

Introduction
- More practice
Today’s (and future) objectives
Reading and reorganizing the data
Exploring and summarizing data with dplyr
Converting datasets/tables between long and wide format
- More practice
A plot
Perhaps a nicer plot
- More practice
Your turn (breakout rooms)
A sample report (and inline code)

*This lab draws from materials developed by Rebecca Ferrell, Max Schneider and Jess Kunke for CSSS 592 (Longitudinal Data Analysis).

Here are all the libraries we’ll use today. You can go ahead and install them (if you don’t have them) and load them.

# if you need to install, e.g., knitr: install.packages("knitr")
library(knitr) # for making tables
library(dplyr) # for the pipe operator %>% and the group_by and summarize functions
library(tidyr) # for the spread function
library(ggplot2) # for plotting

Introduction

Today we’ll work with a new data set, milk.csv, from a study in which cows were given different diets and the amount of protein in their milk was measured each week. Yesterday we worked with data from an R package, but we didn’t load data from a file, so here’s one of the many ways to do that:

milk <- read.table("milk.csv", header=TRUE, sep=",")

Reviewing some of the programming/R terms we used yesterday, here we’ve provided three arguments to the function read.table, two of which are named:

We supplied the first argument as an unnamed argument; it’s the filename of the data file we want to read in.
sep is the separator that delineates one column from another; in a csv (comma-separated value) file, values are separated by commas, so we used sep=",". Other common separators include tabs, a single space, or a semicolon.
If you take a look at the csv file in a text editor or in Excel, you’ll see that it has a header; it has the names of the columns across the top of the file. We use the header=TRUE argument in the read.table function to tell R to read that first line as variable names instead of as data.
- To test this, try running the line of code with header=FALSE instead.)
- What is the default option for the header argument? (Hint: try ?read.table.)

So now let’s review some skills from yesterday as part of exploring this dataset.

What are the variables/columns in this data set? What types do they have?
How many diets were tested in this study?
Over how many weeks did this study measure protein levels (at least as far as is reflected in this dataset)?

# you can type your code here for 1-3 above!

More practice

What does milk$Cow[milk$Diet=="Barley"] do?
Based on 1, how can you find out for what weeks Cow 3 has data in this dataset?
Can you find another cow that doesn’t have data for all 19 weeks?

Today’s (and future) objectives

Here are some other questions we might want to ask or things we might want to do that we haven’t yet discussed how to do; we’ll talk about some of these today and in the coming days depending on interest/time.

Is there any missing data? If there is, how can we handle that? (Note: there is a whole field of statistics studying how best to handle missing data in different situations.)
Compute the average protein level at each week across just the cows on a given diet.
Plot cows’ milk protein levels over time, grouped by diet, so that we can visually examine whether diet seems to make any difference.
How much variation do we see across cows on a given diet?

(What are some other questions you can think of to ask about this data set? What else might you want to learn how to do?)

Reading and reorganizing the data

This data already had a header, and the names are pretty intuitive. But sometimes one of those is not true for the dataset you’re working with, so here is how to rename your variables. First, let’s pretend we don’t have the header by skipping the first row and setting header=FALSE:

milk <- read.table("milk.csv", header=FALSE, skip=1, sep=",")
# check the variable names:
head(milk)

##       V1 V2 V3   V4
## 1 Barley  1  1 3.63
## 2 Barley  1  2 3.57
## 3 Barley  1  3 3.47
## 4 Barley  1  4 3.65
## 5 Barley  1  5 3.89
## 6 Barley  1  6 3.73

colnames(milk)

## [1] "V1" "V2" "V3" "V4"

# change the variable names
# note: both colnames() and names() here do the same thing
colnames(milk) <- c("Diet", "Cow", "Week", "Protein")
colnames(milk)

## [1] "Diet"    "Cow"     "Week"    "Protein"

Exploring and summarizing data with `dplyr`

Suppose we want to see how many observations we have for each diet- is it fairly balanced? Let’s see a way to do this using the dplyr library.

dplyr is an R package for manipulating data frames. It makes use of a special operator called a “pipe” (%>%) which means “take the object on the left and do the command on the right to it”. dplyr speeds up common operations, especially in data processing on particular variables in a data set. For example, with piping, you don’t need to keep retyping the name of a data frame when referencing its columns. This package also contains several useful commands for common data processing tasks.

# first let's see how this pipe works, using the example of the filter function
result1 <- filter(milk, Diet=="Barley")
# take a look at result1 here first, then run the next line
result2 <- milk %>% filter(Diet=="Barley")
sum(result1 != result2) # = 0, so the two results are identical

## [1] 0

# can also do identical(result1, result2)

Note that if your line ends with an operator like + or %>%, or with a comma, then R will expect your line to continue onto the next line; this is nice for splitting long or multi-step commands into multiple lines as you’ll see in the next (and other) examples.

So now let’s address our question: how many observations we have for each diet- is it fairly balanced?

n_obs_by_diet <- milk %>%
  group_by(Diet) %>%
  summarize(n=n())

The group_by() function groups the observations by the variable or variables you pass to it. This is like select() in that it doesn’t alter the data frame, just determines how the data are passed to the next function.
The summarize() function creates a new data frame with the results of evaluating the functions you pass as arguments to summarize() for each of the groups you formed previously using group_by(). (The documentation at ?summarize() will show you several of the functions you can pass to it.)
Here, the function passed to summarize() is n(), which computes the number of elements in each of the groups.

Since we group by diet and then summarize using the count function n(), the end result is a data frame whose rows are the diets, and each row of which contains the name of the diet and the number of observations for that diet (this includes all weeks, all cows).

We can display this in a table using the knitr package (or many other packages). Note: this table is formatted differently in PDF vs HTML knitted documents.

kable(n_obs_by_diet, caption="Number of observations by diet")

Number of observations by diet
Diet	n
Barley	425
Lupins	453
Mixed	459

These tools also allow us to tackle one of the problems we posed at the beginning: let’s compute the average protein level at each week for each diet.

avg_protein <- milk %>%
  group_by(Diet, Week) %>%
  summarize(avg_protein_value=mean(Protein))

## `summarise()` has grouped output by 'Diet'. You can override using the `.groups` argument.

kable(avg_protein, caption="Average protein level by diet and by week")

Average protein level by diet and by week
Diet	Week	avg_protein_value
Barley	1	3.886800
Barley	2	3.642500
Barley	3	3.498000
Barley	4	3.376400
Barley	5	3.484400
Barley	6	3.386400
Barley	7	3.468800
Barley	8	3.503200
Barley	9	3.511739
Barley	10	3.518800
Barley	11	3.455000
Barley	12	3.429200
Barley	13	3.512400
Barley	14	3.506800
Barley	15	3.541579
Barley	16	3.601765
Barley	17	3.682000
Barley	18	3.640667
Barley	19	3.640000
Lupins	1	3.758148
Lupins	2	3.427778
Lupins	3	3.372963
Lupins	4	3.294074
Lupins	5	3.238077
Lupins	6	3.280370
Lupins	7	3.187200
Lupins	8	3.310000
Lupins	9	3.347037
Lupins	10	3.268846
Lupins	11	3.232593
Lupins	12	3.213704
Lupins	13	3.333704
Lupins	14	3.254074
Lupins	15	3.263500
Lupins	16	3.265000
Lupins	17	3.252000
Lupins	18	3.302000
Lupins	19	3.205714
Mixed	1	3.861111
Mixed	2	3.540000
Mixed	3	3.345556
Mixed	4	3.277778
Mixed	5	3.337778
Mixed	6	3.392593
Mixed	7	3.333333
Mixed	8	3.397692
Mixed	9	3.435185
Mixed	10	3.437037
Mixed	11	3.354815
Mixed	12	3.374074
Mixed	13	3.410769
Mixed	14	3.372222
Mixed	15	3.446000
Mixed	16	3.570588
Mixed	17	3.511250
Mixed	18	3.450625
Mixed	19	3.395714

Converting datasets/tables between long and wide format

Great! But maybe a better format for the previous table would be a 19 x 3 table, where each row is a week, each column is a diet, and the values in the table are the average protein values for that week-diet combination. For this, we want to turn this long-format table (every row is a week-diet combination, so the average protein values are all in one long column) into a wide-format table (we break that column into multiple columns according to what diet they’re from, making the table wider). We’ll use the spread function from the tidyr package. Note that I also display fewer digits here.

avg_protein_wide <- avg_protein %>%
    # spread takes these arguments:
    # key=the variable to use for the names of the new columns in wide format,
    # value=the current variable/column whose values will be the values in those new columns
    spread(key=Diet, value=avg_protein_value)

kable(avg_protein_wide, digits=2, caption="Average protein level by diet and by week")

Average protein level by diet and by week
Week	Barley	Lupins	Mixed
1	3.89	3.76	3.86
2	3.64	3.43	3.54
3	3.50	3.37	3.35
4	3.38	3.29	3.28
5	3.48	3.24	3.34
6	3.39	3.28	3.39
7	3.47	3.19	3.33
8	3.50	3.31	3.40
9	3.51	3.35	3.44
10	3.52	3.27	3.44
11	3.46	3.23	3.35
12	3.43	3.21	3.37
13	3.51	3.33	3.41
14	3.51	3.25	3.37
15	3.54	3.26	3.45
16	3.60	3.27	3.57
17	3.68	3.25	3.51
18	3.64	3.30	3.45
19	3.64	3.21	3.40

Alternatively, the developers of tidyr have also created a replacement for spread called pivot_wider. You may find the pivot_wider function easier to understand.

avg_protein_wide_2 <- avg_protein %>%
  # pivot_wider takes these arguments:
  # names_from=the variable to use for the names of the new columns in wide format,
  # values_from=the current variable/column whose values will be the values in those new columns
  pivot_wider(names_from=Diet, values_from=avg_protein_value)

What if we have a wide-format dataset and we want to transform it to a long-format dataset? That means we want to gather multiple columns and transform them into one long column. Notice that right now we have four columns: Week, Barley, Lupins, and Mixed. The second through fourth columns are the average protein values for different diets. To make the long format data, we want to replace those three columns with two columns: a Diet column equal to “Barley”, “Lupins”, or “Mixed” (the names of the three columns we currently have), and an avg_protein_value column with the values that are currently in the three columns. (Note: here I’m using the same names as in the original long-format table avg_protein, but we could instead call them diet_type and avg_prot or some other intuitive names).

avg_protein_long <- avg_protein_wide %>%
  # gather takes these arguments:
  # key = name of the variable to be created
  # value = name of the variable measured in each column that we're combining (this is going to be the name of the new long column we're making)
  # the third argument is a vector of columns; see ?select (linked from ?gather) for more details
  # note that any variable not mentioned will be unchanged
  gather(key=Diet, value=avg_protein_value, Barley:Mixed) %>%
  # optional: sort in order by certain columns
  # here this line is not necessary, but often this is useful
  arrange(Diet, Week)
head(avg_protein_long)

## # A tibble: 6 x 3
##    Week Diet   avg_protein_value
##   <int> <chr>              <dbl>
## 1     1 Barley              3.89
## 2     2 Barley              3.64
## 3     3 Barley              3.50
## 4     4 Barley              3.38
## 5     5 Barley              3.48
## 6     6 Barley              3.39

Again, you could also use the alternative function called pivot_longer:

avg_protein_long_2 <- avg_protein_wide %>%
  # gather takes these arguments:
  # names_to = name of the variable to be created
  # values_to = name of the variable measured in each column that we're combining (this is going to be the name of the new long column we're making)
  # the cols argument is a vector of columns; see ?select (linked from ?gather) for more details
  # note that any variable not mentioned will be unchanged
  pivot_longer(cols = Barley:Mixed, names_to="Diet", values_to="avg_protein_value") %>%
  # optional: sort in order by certain columns
  # here this line is not necessary, but often this is useful
  arrange(Diet, Week)
head(avg_protein_long_2)

## # A tibble: 6 x 3
##    Week Diet   avg_protein_value
##   <int> <chr>              <dbl>
## 1     1 Barley              3.89
## 2     2 Barley              3.64
## 3     3 Barley              3.50
## 4     4 Barley              3.38
## 5     5 Barley              3.48
## 6     6 Barley              3.39

More practice

What happens if you change arrange(Diet, Week) to arrange(Week, Diet)?
What if you replace gather(key=Diet, value=avg_protein_value, Barley:Mixed) with gather(key=Diet, value=avg_protein_value, !Week)?
What if you replace gather(key=Diet, value=avg_protein_value, Barley:Mixed) with gather(key=Diet, value=avg_protein_value, c(Barley, Mixed))? Why does this no longer make sense as a table?

A plot

Let’s plot the protein levels (y-axis) by week (x-axis) for each cow (one line per cow), and we can color code the lines based on the cow’s diet.

library(ggplot2)
ggplot(data=milk, aes(x=Week, y=Protein, group=Cow, color=Diet)) +
    geom_point() +
    geom_line() +
    ggtitle("Protein levels by week for each cow")

Let’s pick this apart a bit. Notice that if you just plot the first line with the ggplot() command, it doesn’t plot anything! It just sets up the plotting area.

ggplot(data=milk, aes(x=Week, y=Protein, group=Cow, color=Diet))

So let’s store this as a plotting object called milk_graph and then add things to it one by one to see what’s going on. Run these lines yourself to see the results.

# Store the graph so far
milk_graph <- ggplot(data=milk, aes(x=Week, y=Protein, group=Cow, color=Diet))

# View what it looks like so far- should look the same as the blank plotting region above
milk_graph

# Let's add geom_point() - what does it do?
milk_graph +
  geom_point()

# What does geom_line() do?
milk_graph +
    geom_point() +
    geom_line()

# What does ggtitle() do?
milk_graph +
    geom_point() +
    geom_line() +
    ggtitle("Protein levels by week for each cow")

In this example, we are using the following:

Aesthetics (aes):
- x: $x$-axis, our time variable
- y: $y$-axis, our outcome variable
- group: our unit variable
- color: variable we want to use to segment by color
Layers (separated by +), each of which can take arguments:
- ggplot: base layer containing information about the data and overall aesthetics
- geom_point: scatterplot points at $x$ and $y$ values
- geom_line: line connecting values within each group
- ggtitle: title for the plot. xlab and ylab work similarly for labeling the axes if our variable names are not descriptive enough.

Additional aesthetics we can use are shape (symbols used for points), linetype, size (dot size/line thickness), alpha (transparency level, 0 for fully transparent and 1 for fully opaque), and fill (color for areas like bar charts).

We can change the general appearance of parts of the graph that aren’t related to specific variables (e.g. use large sized symbols for the geom_point layer). These are passed as non-aesthetic options to the layers; that is, in the specific layer and not enclosed in an aes() group in the ggplot() main layer.

For more information on modifying colors, legends, etc., see http://www.cookbook-r.com/Graphs/index.html. For specific R color names, see http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf. Shapes and linetypes can be modified in a similar way, with more information at http://www.cookbook-r.com/Graphs/Shapes_and_line_types.

Perhaps a nicer plot

ggplot2 gives us tons of options. For example, we can get rid of the points (no geom_point), get rid of the gray background (add theme_bw layer), and change the colors used (set scale_color_manual values). This is probably the best (cleanest and most useful) of the three plots so far:

# storing the graph to an object -- I can add more layers later!
milk_graph <- ggplot(data=milk, aes(x=Week, y=Protein, group=Cow, color=Diet)) +
    geom_line(alpha=0.5) +
    ggtitle("Protein levels by week for each cow") +
    theme_bw() +
    scale_color_manual(values=c("mediumorchid4", "darkgoldenrod1", "deepskyblue"))

milk_graph

More practice

We already computed the average protein level at each week across just the cows on a given diet. Plot this using the tools you learned above. This should give you a plot of average protein level versus week, with just three lines (each a different color, just one line for each diet since you’ve averaged over all the cows in each diet group).
Try changing some of your plot settings: for example, you might plot just lines, just points, or both; change the theme; or find out how to change the axis labels.

For numbers 3-5, start with the following plot:

# storing the graph to an object -- I can add more layers later!
milk_graph <- ggplot(data=milk, aes(x=Week, y=Protein, group=Cow, color=Diet)) +
    geom_line(alpha=0.5) +
    ggtitle("Protein levels by week for each cow") +
    theme_bw() +
    scale_color_manual(values=c("mediumorchid4", "darkgoldenrod1", "deepskyblue"))

milk_graph

What does the following code do?

milk_graph +
    facet_grid(. ~ Diet)

What does the following code do? How is the result different from the previous question?

milk_graph +
    facet_grid(Diet ~ .)

What does the following code do?

milk_graph +
    facet_wrap( ~ Cow, ncol=10)

Your turn (breakout rooms)

Work on the “More practice” sections above. Feel free to ask for further exercises or ask about additional skills.

A sample report (and inline code)

For this lab, there were two Rmd files; the other file is called RLab3-sample-report.Rmd. Let’s check out that file now. It’s meant to be a simple example of a report you could make with this data using R Markdown. The examples there are also ones we covered here, but you can see an example of formatting a final report in R Markdown to share to someone.

The code is run but not displayed in the final HTML file. Sometimes you might want to show select code or all your code depending on what the report is for or who the audience is. Sometimes you might give them both the HTML/PDF file and the Rmd file; in that case, often you might hide the code in the HTML/PDF and they can refer to the Rmd file if they want to see the details of your analysis.
If you give someone your Rmd, they can completely reproduce the file (and your analysis, if you were able to include all the necessary code in the Rmd file).
On lines 37 and 61, the sample report shows examples of an R Markdown feature we haven’t talked about yet: inline code. This can be very nice when you want the body of your text to summarize/report results from your analysis; if you update your code and those numbers change, normally you’d have to keep proofreading to make sure you updated the numbers everywhere they appear in the text. With inline code, R will automatically update those numbers with their new values.

Math Camp, Lab 3

Jess Kunke*

9/15/2021

Introduction

More practice

Today’s (and future) objectives

Reading and reorganizing the data

Exploring and summarizing data with `dplyr`

Converting datasets/tables between long and wide format

More practice

A plot

Perhaps a nicer plot

More practice

Your turn (breakout rooms)

A sample report (and inline code)

Math Camp, Lab 3

Jess Kunke*

9/15/2021

Introduction

More practice

Today’s (and future) objectives

Reading and reorganizing the data

Exploring and summarizing data with dplyr

Converting datasets/tables between long and wide format

More practice

A plot

Perhaps a nicer plot

More practice

Your turn (breakout rooms)

A sample report (and inline code)

Exploring and summarizing data with `dplyr`