Some notes following up on yesterday

Review of yesterday

Finding your way around the R Studio environment:

We learned how to store values under variable names, like x <- 3 will create a variable called x with the value 3. We can then use that in math operations:

x <- 3
x + 4
## [1] 7
# Notice: what is the value of x now: is it 3, or 7? Why?
x*10
## [1] 30
x-1
## [1] 2
x^2-1
## [1] 8

Let’s play with some data!

Let’s explore some data! A lot of R functions don’t come with R itself but are available in separate packages (also called libraries) that you can install along the way as you need them. A number of these packages also come with freely available datasets. Start by loading the ggplot2 package with the library(ggplot2) command and let’s check out the mpg dataset. (Check out in the Rmd file how I got some of that text to have code-like font.)

install.packages("ggplot2")
library(ggplot2) # just for display, so you can see the command to run
# Note: in the Rmd file, how did I prevent R Studio from running this line when it knits the HTML file?
# I did that since I didn't want it to print all the messages and I already loaded the library in the setup chunk above.

Let’s explore this dataset and see what’s in it, how it was generated, etc. A good place to start is the command ?mpg. What does that do?

We can also use the following commands to check out the data structure some more. Feel free to add your own comments to the code chunk so you can make notes for your future self on what these commands do!

str(mpg)
summary(mpg)
head(mpg)
# compare behavior of tibble to data.frame object:
head(data.frame(mpg))
data(mpg) #loads the data set into the current environment

We can see how big the dataset is too:

dim(mpg)
dim(mpg)[1] # what does this do?

Now let’s explore the relationship between city mileage and highway mileage by plotting one versus the other with R’s built-in plot function:

plot(mpg$cty, mpg$hwy)

Let’s try extracting just individual numbers/values from this dataset.

mpg$year
mpg$year[3]
mpg$year[1:5]

How can you print just the first, fourth, and seventh entries of mpg$year? (Add a code chunk here to document the solution we discuss for your own notes.) Related to this discussion: how might you decide when to define something as its own variable versus just using it in the code?

How can you change the above code chunk so that your HTML file will show the output of those commands? What if you want the output but not the code itself?

An aside: R is what programers call a “1-indexed” language because the indices start with 1. Some languages are “0-indexed” because they start counting from 0: the index 0 gives you the first element/row/column, the index 1 gives you the second, etc.

So what if you realized there’s an error in the data and the third year should be 2010 instead of 2008? We can use the assignment operator <- as we did yesterday to fix that:

mpg$year[3] <- 2010

How do we know whether to put 2010 on the left and mpg$year[3] on the right or vice versa? Think of it like an arrow: you want to store the value 2010 in the third entry of the year column, or assign 2010 to mpg$year[3].

What is the range of values of the year variable? How many different years are represented in this dataset? What about how long the column is?

range(mpg$year)
## [1] 1999 2010
unique(mpg$year)
## [1] 1999 2010 2008
length(mpg$year)
## [1] 234

Let’s add a new column to the data. Say we want the difference between the highway and city mileage.

diff_mileage <- hwy - cty # why does this fail? how do you fix it?

Say we want to add a dummy variable for whether the car was manufactured during/after 2002 or before 2002. Let’s call this after2002.

mpg$after2002 <- mpg$year >= 2002
# what happens if you leave off the mpg$ at the start of the line above?
# also, why didn't this code print any output?

What is the variable type of after2002 (and how do you find that out)? How would you change it to integer?

Note: in order for the >= 2002 to work, year must be a numeric variable. Sometimes when you read in data, you might end up with columns that are naturally numeric being read in as character/text variables instead. These are what we call different data types. So if you run into an error, try str(mpg) to verify the type of your variable.

Let’s see how many cars were made in the year 2008:

sum(mpg$year=2008) # why doesn't this work? how do you fix it?

Your turn

Load the midwest dataset from the ggplot2 library.

  1. What variables are in this dataset? What types are they (integer, etc)? Read a little about where this dataset came from and what the variables and their values mean.
  2. What are the dimensions of this dataset? What does each row represent?
  3. Pick two columns that make sense for a scatterplot and make a plot of one column versus the other. Format the plot as nicely as you can.
  4. Compute the mean of the area column.
  5. Add a column called popbw to the dataset that is the total number of white people and black people.
  6. Add a column called stateIL that equals 1 if the state is IL and 0 otherwise.
  7. Come up with your own column to add and figure out how to add it.
  8. What other questions might you want to ask about this dataset? How would you use R to do that, or what other skills might you need to learn in R in order to do that?