Data Carpentry R materials


Motivations for this lesson

Presentation of RStudio

Let's start by learning about our tool.

Point out the different windows in R.

Before we get started

Your working directory should now look like this:

How it should look like at the beginning of this lesson

Basics of R

R is a versatile, open source programming/scripting language that's useful both for statistics but also data science. Inspired by the programming language S.

Commenting

Use # signs to comment. Comment liberally in your R scripts. Anything to the right of a # is ignored by R.

Assignment operator

<- is the assignment operator. Assigns values on the right to objects on the left. Mostly similar to = but not always. Learn to use <- as it is good programming practice. Using = in place of <- can lead to issues down the line.

= should only be used to specify the values of arguments in functions for instance read.csv(file="data/some_data.csv").

Good practices

There are two main ways of interacting with R: using the console or by using script files (plain text files that contain your code).

The recommended approach when working on a data analysis project is dubbed “the source code is real”. The objects you are creating should be seen as disposable as they are the direct realization of your code. Every object in your analysis can be recreated from your code, and all steps are documented. Therefore, it is best to enter as little commands as possible in the R console. Instead, all code should be written in script files, and evaluated from there. The R console should be used to inspect objects, test a function or get help. With this approach, the .Rhistory file automatically created during your session should not be very useful.

Similarly, you should separate the original data (raw data) from intermediate datasets that you may create for the need of a particular analysis. For instance, you may want to create a data/ directory within your working directory that stores the raw data, and have a data_output/ directory for intermediate datasets and a figure_output/ directory for the plots you will generate.

Introduction to R

Objectives

The R syntax

Start by showing an example of a script

Example of a simple R script

Creating objects

You can get output from R simply by typing in math in the console

3 + 5
12/7

However, to do useful and interesting things, we need to assign values to objects. To create objects, we need to give it a name followed by the assignment operator <- and the value we want to give it:

weight_kg <- 55

Objects can be given any name such as x, current_temperature, or subject_id. You want your object names to be explicit and not too long. They cannot start with a number (2x is not valid but x2 is). There are some names that cannot be used because they represent the names of fundamental functions in R (e.g., if, else, for, see here for a complete list). In general, even if it's allowed, it's best to not use other function names (e.g., c, T, mean, data, df, weights), in doubt check the help to see if the name is already in use. It's also best to avoid . as in my.dataset. It is also recommended to use nouns for variable names, and verbs for function names.

When assigning a value to an object, R does not print anything. You can force to print the value by using parentheses or by typing the name:

(weight_kg <- 55)
weight_kg

Now that R has weight_kg in memory, we can do arithmetic with it. For instance, we may want to convert this weight in pounds (weight in pounds is 2.2 times the weight in kg):

2.2 * weight_kg

We can also change a variable's value by assigning it a new one:

weight_kg <- 57.5
2.2 * weight_kg

This means that assigning a value to one variable does not change the values of other variables. For example, let's store the animal's weight in pounds in a variable.

weight_lb <- 2.2 * weight_kg

and then change weight_kg to 100.

weight_kg <- 100

What do you think is the current content of the object weight_lb? 126.5 or 200?

Exercise

What are the values after each statement in the following?

mass <- 47.5          # mass?
age  <- 122           # age?
mass <- mass * 2.0    # mass?
age  <- age - 20      # age?
massIndex <- mass/age # massIndex?

Vectors and data types

A vector is the most common and basic data structure in R, and is pretty much the workhorse of R. It's a group of values, mainly either numbers or characters. You can assign this list of values to a variable, just like you would for one item. For example we can create a vector of animal weights:

weights <- c(50, 60, 65, 82)
weights

A vector can also contain characters:

animals <- c("mouse", "rat", "dog")
animals

There are many functions that allow you to inspect the content of a vector. length() tells you how many elements are in a particular vector:

length(weights)
length(animals)

class() indicates the class (the type of element) of an object:

class(weights)
class(animals)

The function str() provides an overview of the object and the elements it contains. It is a really useful function when working with large and complex objects:

str(weights)
str(animals)

You can add elements to your vector simply by using the c() function:

weights <- c(weights, 90) # adding at the end
weights <- c(30, weights) # adding at the beginning
weights

What happens here is that we take the original vector weights, and we are adding another item first to the end of the other ones, and then another item at the beginning. We can do this over and over again to build a vector or a dataset. As we program, this may be useful to autoupdate results that we are collecting or calculating.

We just saw 2 of the 6 data types that R uses: "character" and "numeric". The other 4 are:

Vectors are one of the many data structures that are part of R. Other important ones are lists (list), matrices (matrix), data frames (data.frame) and factors (factor). We will talk about data.frame soon but first we need to talk about factor.

Factors

Factors are special vectors that represent categorical data. Factors can be ordered or unordered and are important for statistical analysis and for plotting.

Factors are stored as integers that have labels associated the unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.

Once created factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order, so if you have a factor with 2 levels, for instance

sex <- factor(c("male", "female", "female", "male"))

R will assign 1 to the level "female" and 2 to the level "male" (because f comes before m). You can check this by using the function levels(), and check the number of levels using nlevels():

levels(sex)
nlevels(sex)

Sometimes, the order of the factors does not matter, other times you might want factors to be ordered (or ranked), because the order is meaningful or it is required by a particular type of analysis (e.g., low, medium, high).

food <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(food)
food <- factor(food, levels=c("low", "medium", "high"), ordered=TRUE)
levels(food)

In R's memory, these vectors of factors are represented by numbers 1, 2, 3. They are better than using simple integer labels because factors are what are called self describing: "male" and "female"“ is more descriptive than 1s and 2s. Which is male? 1 or 2? You wouldn't be able to tell with just integer data. Factors have this information built in. It is particularly helpful when there are many levels (like the species in our example data set), or when there is no additional metadata (for simple datasets, real datasets always have metadata, right?)

If you need to convert a factor to a character vector, simply use as.character(x).

Converting a factor to a numeric vector is however a little trickier, and you have to go via a character vector. Compare:

f <- factor(c(1,5,10,2))
as.numeric(f) ## wrong! and there will be no warnings...
as.numeric(as.character(f))

Exercise

The function table() tabulates observations and can be used to create bar plots quickly. For instance:

exprmt <- factor(c("treat1", "treat2", "treat1", "treat3", "treat1", "control",
                   "control", "treat1", "treat2", "control", "treat3", "control"))
table(exprmt)
barplot(table(exprmt))

plot of chunk unnamed-chunk-20

Question How can you recreate this plot but by having "control” being listed last instead of first?

Importing the survey data

Objectives

We want to:

To do all that, we'll have to learn a little bit about programming.

Presentation of the Survey Data

We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a .csv file: each row holds information for a single animal, and the columns represent survey_id (represented as global unique identifier, GUID), month, day, year, plot (represented as a GUID), species (a 2 letter code, see the species.csv file for correspondance), sex (“M” for males and “F” for females), wgt (the weight in grams).

The first few rows of the survey dataset look like this:

"3a15270e-09c7-47a3-ab2e-0ffcbda1f162","8","19","1977","b564c1fc-bd87-4525-92ee-8bb36928fc10","DM","M","40"
"72ab0a58-4e6d-4aa7-b5c7-abb10d49c141","8","19","1977","a7498949-7b94-4f8e-b253-3e26bbf80710","DM","M","48"
"a174df84-6bf3-42bb-8dae-be3fd45566a7","8","19","1977","7dcaddf5-d272-4b69-a49c-d4f3cf000084","DM","F","29"
"3658f574-1f07-4601-b053-f8656cb731cf","8","19","1977","7dcaddf5-d272-4b69-a49c-d4f3cf000084","DM","F","46"
"0d17daf5-df94-475a-b57f-8aea2477cee3","8","19","1977","a7498949-7b94-4f8e-b253-3e26bbf80710","DM","M","36"

To load our survey data, we need to locate the surveys.csv file. We will use setwd() to tell R to set our working directory, in other words where to look for files and save outputs, and read.csv() to load into memory (as a data.frame) the content of the CSV file.

surveys <- read.csv('data/surveys.csv')

This statement doesn't produce any output because assignment doesn't display anything. If we want to check that our data has been loaded, we can print the variable's value:

surveys

Wow… that was a lot of output. At least it means the data loaded properly. Let's check the top (the first 6 lines) of this data.frame using the function head():

head(surveys)
##                              survey_id month day year
## 1 8f58c260-4b8d-4efb-893c-84f0e4ce9e9a     7  16 1977
## 2 1445e762-ace8-4f00-b01d-d6f75a7ba753     7  16 1977
## 3 8c5fe0a9-1f7d-46d2-8e9d-237b310d3fde     7  16 1977
## 4 38ef8288-7db5-45df-9435-2f7ffe8a90ef     7  16 1977
## 5 4d9f6537-7aac-4713-a390-80877ef21f6f     7  16 1977
## 6 fae16612-97dc-4c09-bb74-ce0f3e2d3919     7  16 1977
##                                plot_id species sex wgt
## 1 aeece3f0-4acf-4aaa-87cd-e04ac6c8ad2c    <NA>   M  NA
## 2 b564c1fc-bd87-4525-92ee-8bb36928fc10    <NA>   M  NA
## 3 aeece3f0-4acf-4aaa-87cd-e04ac6c8ad2c      DM   F  NA
## 4 a7498949-7b94-4f8e-b253-3e26bbf80710      DM   M  NA
## 5 b564c1fc-bd87-4525-92ee-8bb36928fc10      DM   M  NA
## 6 b7eb122e-a7c4-4982-97a0-fa67e45e6faf      PF   M  NA

At this point, make sure all participants have the data loaded

Let's now check the structure of this data.frame in more details with the function str():

str(surveys)

Show how to get this information from the “Environment” tab in RStudio.

Exercise

Based on the output of str(surveys), can you answer the following questions?

About data.frame

data.frame is the de facto data structure for most tabular data and what we use for statistics and plotting. They are a collection of vectors of identical lengths (each representing a column), and each vector can be of a different data type (e.g., characters, integers, factors). The str() function is useful to inspect the data types of the columns.

data.frame can be created by the functions read.csv() or read.table(), in other words, when importing spreadsheets from your hard drive (or the web).

By default, data.frame converts (= coerces) columns that contain characters (i.e., text) into the factor data type. Depending on what you want to do with the data, you may want to keep these columns as character. To do so, read.csv() and read.table() have an argument called stringsAsFactors which you can set to FALSE:

some_data <- read.csv("data/some_file.csv", stringsAsFactors=FALSE)

You can also create data.frame manually with the function data.frame(). This function can also take the argument stringsAsFactors. Compare the output of these examples:

Point to the difference between character and factor output with str()

example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
                           feel=c("furry", "furry", "squishy", "spiny"),
                           weight=c(45, 8, 1.1, 0.8))
str(example_data)
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
                           feel=c("furry", "furry", "squishy", "spiny"),
                           weight=c(45, 8, 1.1, 0.8), stringsAsFactors=FALSE)
str(example_data)

Exercises

  1. Can you predict the class for each of the columns in the following example?

    country_climate <- data.frame(country=c("Canada", "Panama", "South Africa", "Australia"),
                                     climate=c("cold", "hot", "temperate", "hot/temperate"),
                                     temperature=c(10, 30, 18, "15"),
                                     north_hemisphere=c(TRUE, TRUE, FALSE, FALSE),
                                     has_placentals=c(TRUE, TRUE, TRUE, "FALSE"))
    

    Check your gueses using str(country_climate). Are they what you expected? Why? why not?

    R coerces (when possible) to the data type that is the least common denominator and the easiest to coerce to.

  2. There are a few mistakes in this hand crafted data.frame, can you spot and fix them? Don't hesitate to experiment!

    author_book <- data.frame(author_first=c("Charles", "Ernst", "Theodosius"),
                                 author_last=c(Darwin, Mayr, Dobzhansky),
                                 year=c(1942, 1970))
    

Indexing

If we want to get a single value from the data frame, we must provide an index in square brackets, just as we do in math. For instance:

animals <- c("mouse", "rat", "dog", "cat")
animals[2]
## [1] "rat"
animals[c(3, 2)]
## [1] "dog" "rat"
animals[2:4]
## [1] "rat" "dog" "cat"
animals[c(1:3, 2:4)]
## [1] "mouse" "rat"   "dog"   "rat"   "dog"   "cat"

R indexes starting at 1. Programming languages like Fortran, MATLAB, and R start counting at 1, because that's what human beings have done for thousands of years. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that's simpler for computers to do.

: is a special function that creates numeric vectors of integer in increasing or decreasing order, test 1:10 and 10:1 for instance. The function seq() (for sequence) can create more complex patterns:

seq(1, 10, by=2)
seq(1, 10, length.out=4)
seq(1, by=5, length.out=100)
seq(1, 8, by=3) # sequence stops to stay below upper limit

Our survey data is has rows and columns, if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers.

surveys[1, 1]   # first element in the first column of the data frame
surveys[1, 6]   # first element in the 6th column
surveys[1:3, 7] # first three elements in the 7th column
surveys[3, ]    # the 3rd elements for all columns
surveys[, 8]    # the entire 8th column
head_surveys <- surveys[1:6, ] # surveys[1:6, ] is equivalent to head(surveys)

Exercices

  1. The function nrow() on a data.frame returns the number of rows. Use it, in conjuction with seq() to create a new data.frame called surveys_by_10 that includes every 10th row of the survey data frame starting at row 10 (10, 20, 30, …)

Statistics on subsets of data

When analyzing data, though, we often want to look at partial statistics, such as the maximum value per species or the average value per plot.

One way to do this is to select the data we want to create a new temporary array, using the subset() function

Let's look at just the animals of species 'DO'

speciesDO <- subset(dat, species == 'DO')
## Error: object 'dat' not found

We could see in our table from before that 'DO' had 3027 species. Let's check to see if that's what we have by checking the number of rows

nrow(speciesDO)
## Error: object 'speciesDO' not found

EXERCISE

Calculate the mean and standard deviation of just the DO species

  1. If data holds our array of survey data, what does data[3:3, 4:4] produce? What about data[3:3, 4:1]? Explain the results to the person sitting next to you

Manipulating Data

Now that our data is in memory, we can start doing things with it. First, let's ask what type of thing data refers to:

class(dat)
## Error: object 'dat' not found
str(dat)
## Error: object 'dat' not found
summary(dat)
## Error: object 'dat' not found

The class output tells us that data currently is a data.frame in R.

This is similar to a spreadsheet in excel, that many of us are familiar with using.

The str output tells us what columns there are and what type they are.

The summary output summarizes our columns and shows us the range of our values.

Useful functions

str output tells us the dimensions and the data types (int is integer) of each column.

We can see what its shape is like this:

dim(dat)
## Error: object 'dat' not found
nrow(dat)
## Error: object 'dat' not found
ncol(dat)
## Error: object 'dat' not found

This tells us that data has 35549 rows and 8 columns.

Let's look at the structure of the data again

str(dat)
## Error: object 'dat' not found

If we look at 'plot' it says it's an integer. $ plot : int 2 3 2 7 3 1 2 1 1 6 and it is an integer, but in our case we want it to be a factor. Basically it's a category. We don't want to be able to do math with it. We want to be able to ask things about that category. So we're going to change it from an integer to a factor in our data frame.

dat$plot <- as.factor(dat$plot)
## Error: object 'dat' not found

Calculating statistics

We've gotten our data in to R, so that we can do some analysis with it. First, let's get a sense of our data We might for instance want to know how many animals we trapped in each plot, or how many of each species were caught.

We can look at just one column at a time in different ways. We can reference that column by it's number

# Look at the weight column, the 8th one
dat[,8]
## Error: object 'dat' not found

or by its name

# Look at the weight column, by its name wgt
dat$wgt
## Error: object 'dat' not found

If you forget the column names, you can type

colnames(dat)
## Error: object 'dat' not found

or

str(dat)
## Error: object 'dat' not found

will show you the column names and the type of data in them

If we do this, we can see for instance that there are 48 species Factor w/ 48 levels means there are 48 different versions of that factor

So, let's get a list of all the species. The 'unique' command tells us all the unique names in that column.

unique(dat$species)
## Error: object 'dat' not found

Now let's see how many of each species we have

table(dat$species)
## Error: object 'dat' not found

We could even assign it to a variable and make it a data frame to make it easier to look at

species_table <- as.data.frame(table(dat$species))
## Error: object 'dat' not found

Maybe we also want to see how many animals were captured in each plot

table(dat$plot)
## Error: object 'dat' not found

Now we want to do some actual calculations with the data though. Let's calculate the average weight of all the animals. R has a lot of built in statistical functions, like mean, median, max, min

mean(dat$wgt)
## Error: object 'dat' not found

Hmm, we just get NA. That's because we don't have the weight for every animal and it's recorded as NA when we don't. We can't do math on NA. Conveniently R provides a function na.omit() that will omit NAs from your data.

How many animals would we omit. We can look at how many animals we have overall and subtract how many we have after the NAs are omitted.

Because data is in a vector, when we want to know how much of something we have we ask how long it is with the length() function.

length(dat$wgt)
## Error: object 'dat' not found
length(na.omit(dat$wgt))
## Error: object 'dat' not found

We can then subtract those numbers

length(dat$wgt) - length(na.omit(dat$wgt))
## Error: object 'dat' not found

We can see we'll be omitting 3266 animals. Bummer, but not terrible when we've sampled over 35,000 animals.

Let's calculate their average weight

mean(na.omit(dat$wgt))
## Error: object 'dat' not found

It gets a little annoying to type na.omit(dat$wgt) each time we want to do the calculation, so we can actually create a new data frame with the rows that have NA omitted with the complete.cases() command. Don't worry too much about this. You can google it to learn more about it if you need to use it.

dat2 <- dat[complete.cases(dat$wgt),]
## Error: object 'dat' not found
mean(dat2$wgt)
## Error: object 'dat2' not found

EXERCISES

R has a bunch of handy statistical functions built in. Calculate the median, standard deviation, minimum and maximum weight. For bonus points calculate the standard error.

FUNCTIONS - Operations Across Axes

What if we need the maximum weight for all animals, or the average for each plot? As the diagram below shows, we want to perform the operation across an axis:

To support this, in R we can use the apply or 'tapply' function: tapply() takes a vector, so we'll use that

help(tapply) #or ?apply

Apply allows us to repeat a function on all of the rows (1), columns (2), or both(1:2) of an array or matrix.

What if you wanted to now go on and calculate the average weight of each species. You could do this one by one, but you can actually do it all at once with the tapply() function.

The format is

tapply(data_you_want_to_calculate, factor_to_sort_on, function)

tapply(dat2$wgt, dat2$species, mean)
## Error: object 'dat2' not found

Now we can put all the means into a variable

species_means <- tapply(dat2$wgt, dat2$species, mean)
## Error: object 'dat2' not found

We still get NAs. That's because of the way that R keeps track of NAs when you're converting data frames. There are ways to get around that for the apply function, but we can use the aggregate function instead, which we'll cover next.

Challenge

  1. Find the maximum and minimum values for weight for each species
  2. Save these values to a varible.
  3. What is the length of your new variable?
species_max <- tapply(dat2$wgt, dat2$species, max)
## Error: object 'dat2' not found
species_min <- tapply(dat2$wgt, dat2$species, min)
## Error: object 'dat2' not found
length(species_max)
## Error: object 'species_max' not found
length(species_min)
## Error: object 'species_min' not found

Now that we have all this summary information, we can put it back together into a data frame that we can use for further analysis and plotting, provided they are the same length.

d.summary = data.frame(species_means, species_min, species_max)
## Error: object 'species_means' not found

We can also do this with the aggregate function, which deals with the NA rows that we eliminated more nicely. This is a very useful function, and it puts the output in to a data frame rather than a list. If you look at documentation for aggregate() there's a few different ways to write the function. This is one way.

The format is aggregate(what-to-plot~what-you-want-it-sorted-by, data=the-dataset, function)

mean_species <- aggregate(wgt~species, data=dat, mean)
## Error: object 'dat' not found

Maybe we want to look at the data average per species per plot

aggregate(wgt~species+plot, data=dat, mean)
## Error: object 'dat' not found

Or we just want to look at the average of particular species in each plot. Then we can subset the data within the function

aggregate(wgt~species+plot, data=subset(dat, species == "DO"), mean)
## Error: object 'dat' not found

EXERCISES

  1. Create a data frame with the standard deviation of weight for each species
sd_species <- aggregate(wgt~species, data=dat, sd)
## Error: object 'dat' not found

Plotting

The mathematician Richard Hamming once said, “The purpose of computing is insight, not numbers,” and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture (or course) of its own, but we can explore a few features of R's base plotting package and ggplot2 here.

Lets use the average species data that we saved and plot it.

R has built in plotting functions

barplot(mean_species$wgt, names.arg=mean_species$species)
## Error: object 'mean_species' not found

The axis labels are too big though, so you can't see them all. Let's change that

barplot(mean_species$wgt, names.arg=mean_species$species, cex.names=0.4)
## Error: object 'mean_species' not found

and change the color

barplot(mean_species$wgt, names.arg=mean_species$species, cex.names=0.4, col=c("blue"))
## Error: object 'mean_species' not found

EXERCISES

  1. Create a plot showing the standard deviation of the species data
barplot(sd_species$wgt, names.arg=sd_species$species)
## Error: object 'sd_species' not found
  1. Calculate the average weight by plot id
plot_mean <- aggregate(wgt~plot, data=dat, mean)
## Error: object 'dat' not found
  1. Plot the average weight by plot and make the bars red
barplot(plot_mean$wgt, names.arg=plot_mean$species, col=c("red"))
## Error: object 'plot_mean' not found

There's lots of different ways to plot things. You can use

help(barplot)

or search online

There's also a plotting package called ggplot that adds a lot of functionality. I'm not going to go through it, but you can see a gallery of what's possible for plotting with ggplot.

Basically, you can do almost anything, and you can spend infinite time refining it.

If you wanted to output this plot do a pdf rather than to the screen, you can specify where you want the plot to go with the 'pdf' command. If you wanted it to be a jpeg, you would set it as 'jpeg'

Be sure to add the 'dev.off()' command at the end. That command makes it so that the plots go back to getting printed within R. Otherwise every new plot you make will get printed to that pdf.

pdf("R_plot.pdf")
barplot(mean_species$wgt, names.arg=mean_species$species, cex.names=0.4, col=c("blue"))
## Error: object 'mean_species' not found
dev.off()

Package management

install.packages("package-name") will download a package from one of the CRAN mirrors assuming that a binary is available for your operating system. If you have not set a preferred CRAN mirror in your options(), then a menu will pop up asking you to choose a location.

Use old.packages() to list all your locally installed packages that are now out of date. update.packages() will update all packages in the known libraries interactively. This can take a while if you haven't done it recently. To update everything without any user intervention, use the ask = FALSE argument.

In RStudio, you can also do package management through Tools -> Install Packages.

Updating packages can sometimes make changes, so if you already have a lot of code in R, don't run this now. Otherwise let's just go ahead and update our pacakges so things are up to date.

update.packages(ask = FALSE)

Objectives

Loading Data


Words are useful, but what's more useful are the sentences and stories we use them to build. Similarly, while a lot of powerful tools are built into languages like R, even more lives in the libraries they are used to build. Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench. Once it's done, we can ask the library to do things for us.

Key Points