Let's start by learning about our tool.
Point out the different windows in R.
File
menu, click on New project
, choose New directory
, then
Empty project
~/data-carpentry
)Files
tab on the right of the screen, click on New Folder
and
create a folder named data
within your newly created working directory.
(e.g., ~/data-carpentry/data
)data-carpentry-script.R
)Your working directory should now look like this:
R is a versatile, open source programming/scripting language that's useful both for statistics but also data science. Inspired by the programming language S.
Use #
signs to comment. Comment liberally in your R scripts. Anything to the right of a #
is ignored by R.
<-
is the assignment operator. Assigns values on the right to objects on the left. Mostly similar to =
but not always. Learn to use <-
as it is good programming practice. Using =
in place of <-
can lead to issues down the line.
=
should only be used to specify the values of arguments in functions for instance read.csv(file="data/some_data.csv")
.
There are two main ways of interacting with R: using the console or by using script files (plain text files that contain your code).
The recommended approach when working on a data analysis project is dubbed “the source code is real”. The objects you are creating should be seen as disposable as they are the direct realization of your code. Every object in your analysis can be recreated from your code, and all steps are documented. Therefore, it is best to enter as little commands as possible in the R console. Instead, all code should be written in script files, and evaluated from there. The R console should be used to inspect objects, test a function or get help. With this approach, the .Rhistory
file automatically created during your session should not be very useful.
Similarly, you should separate the original data (raw data) from intermediate datasets that you may create for the need of a particular analysis. For instance, you may want to create a data/
directory within your working directory that stores the raw data, and have a data_output/
directory for intermediate datasets and a figure_output/
directory for the plots you will generate.
Start by showing an example of a script
<-
=
for arguments#
and how they are used to document function and its content$
operatorYou can get output from R simply by typing in math in the console
3 + 5
12/7
However, to do useful and interesting things, we need to assign values to objects. To create objects, we need to give it a name followed by the assignment operator <-
and the value we want to give it:
weight_kg <- 55
Objects can be given any name such as x
, current_temperature
, or subject_id
. You want your object names to be explicit and not too long. They cannot start with a number (2x
is not valid but x2
is). There are some names that cannot be used because they represent the names of fundamental functions in R (e.g., if
, else
, for
, see here for a complete list). In general, even if it's allowed, it's best to not use other function names (e.g., c
, T
, mean
, data
, df
, weights
), in doubt check the help to see if the name is already in use. It's also best to avoid .
as in my.dataset
. It is also recommended to use nouns for variable names, and verbs for function names.
When assigning a value to an object, R does not print anything. You can force to print the value by using parentheses or by typing the name:
(weight_kg <- 55)
weight_kg
Now that R has weight_kg
in memory, we can do arithmetic with it. For instance, we may want to convert this weight in pounds (weight in pounds is 2.2 times the weight in kg):
2.2 * weight_kg
We can also change a variable's value by assigning it a new one:
weight_kg <- 57.5
2.2 * weight_kg
This means that assigning a value to one variable does not change the values of other variables. For example, let's store the animal's weight in pounds in a variable.
weight_lb <- 2.2 * weight_kg
and then change weight_kg
to 100.
weight_kg <- 100
What do you think is the current content of the object weight_lb
? 126.5 or 200?
What are the values after each statement in the following?
mass <- 47.5 # mass?
age <- 122 # age?
mass <- mass * 2.0 # mass?
age <- age - 20 # age?
massIndex <- mass/age # massIndex?
A vector is the most common and basic data structure in R, and is pretty much the workhorse of R. It's a group of values, mainly either numbers or characters. You can assign this list of values to a variable, just like you would for one item. For example we can create a vector of animal weights:
weights <- c(50, 60, 65, 82)
weights
A vector can also contain characters:
animals <- c("mouse", "rat", "dog")
animals
There are many functions that allow you to inspect the content of a vector. length()
tells you how many elements are in a particular vector:
length(weights)
length(animals)
class()
indicates the class (the type of element) of an object:
class(weights)
class(animals)
The function str()
provides an overview of the object and the elements it contains. It is a really useful function when working with large and complex objects:
str(weights)
str(animals)
You can add elements to your vector simply by using the c()
function:
weights <- c(weights, 90) # adding at the end
weights <- c(30, weights) # adding at the beginning
weights
What happens here is that we take the original vector weights
, and we are adding another item first to the end of the other ones, and then another item at the beginning. We can do this over and over again to build a vector or a dataset. As we program, this may be useful to autoupdate results that we are collecting or calculating.
We just saw 2 of the 6 data types that R uses: "character"
and "numeric"
. The other 4 are:
"complex"
to represent complex numbers with real and imaginary parts (e.g., 1+4i
) and that's all we're going to say about them"raw"
that we won't discuss further"logical"
for TRUE
and FALSE
(the boolean data type)"integer"
for integer numbers (e.g., 2L
, the L
indicates to R that it's an integer)Vectors are one of the many data structures that are part of R. Other important ones are lists (list
), matrices (matrix
), data frames (data.frame
) and factors (factor
). We will talk about data.frame
soon but first we need to talk about factor
.
Factors are special vectors that represent categorical data. Factors can be ordered or unordered and are important for statistical analysis and for plotting.
Factors are stored as integers that have labels associated the unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
Once created factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order, so if you have a factor with 2 levels, for instance
sex <- factor(c("male", "female", "female", "male"))
R will assign 1
to the level "female"
and 2
to the level "male"
(because f
comes before m
). You can check this by using the function levels()
, and check the number of levels using nlevels()
:
levels(sex)
nlevels(sex)
Sometimes, the order of the factors does not matter, other times you might want factors to be ordered (or ranked), because the order is meaningful or it is required by a particular type of analysis (e.g., low, medium, high).
food <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(food)
food <- factor(food, levels=c("low", "medium", "high"), ordered=TRUE)
levels(food)
In R's memory, these vectors of factors are represented by numbers 1, 2, 3. They are better than using simple integer labels because factors are what are called self describing: "male"
and "female"
“ is more descriptive than 1
s and 2
s. Which is male? 1
or 2
? You wouldn't be able to tell with just integer data. Factors have this information built in. It is particularly helpful when there are many levels (like the species in our example data set), or when there is no additional metadata (for simple datasets, real datasets always have metadata, right?)
If you need to convert a factor to a character vector, simply use as.character(x)
.
Converting a factor to a numeric vector is however a little trickier, and you have to go via a character vector. Compare:
f <- factor(c(1,5,10,2))
as.numeric(f) ## wrong! and there will be no warnings...
as.numeric(as.character(f))
The function table()
tabulates observations and can be used to create
bar plots quickly. For instance:
exprmt <- factor(c("treat1", "treat2", "treat1", "treat3", "treat1", "control",
"control", "treat1", "treat2", "control", "treat3", "control"))
table(exprmt)
barplot(table(exprmt))
Question How can you recreate this plot but by having "control” being listed last instead of first?
We want to:
surveys.csv
) as an exampledata.frame
NA
data.frame
To do all that, we'll have to learn a little bit about programming.
We are studying the species and weight of animals caught in plots in our study
area. The dataset is stored as a .csv
file: each row holds information for a
single animal, and the columns represent survey_id
(represented as global
unique identifier, GUID), month
, day
, year
, plot
(represented as a
GUID), species
(a 2 letter code, see the species.csv
file for
correspondance), sex
(“M” for males and “F” for females), wgt
(the weight in
grams).
The first few rows of the survey dataset look like this:
"3a15270e-09c7-47a3-ab2e-0ffcbda1f162","8","19","1977","b564c1fc-bd87-4525-92ee-8bb36928fc10","DM","M","40"
"72ab0a58-4e6d-4aa7-b5c7-abb10d49c141","8","19","1977","a7498949-7b94-4f8e-b253-3e26bbf80710","DM","M","48"
"a174df84-6bf3-42bb-8dae-be3fd45566a7","8","19","1977","7dcaddf5-d272-4b69-a49c-d4f3cf000084","DM","F","29"
"3658f574-1f07-4601-b053-f8656cb731cf","8","19","1977","7dcaddf5-d272-4b69-a49c-d4f3cf000084","DM","F","46"
"0d17daf5-df94-475a-b57f-8aea2477cee3","8","19","1977","a7498949-7b94-4f8e-b253-3e26bbf80710","DM","M","36"
To load our survey data, we need to locate the surveys.csv
file. We will use setwd()
to tell R to set our working directory, in other words where to look for files and save outputs, and read.csv()
to load into memory (as a data.frame
) the content of the CSV file.
surveys <- read.csv('data/surveys.csv')
This statement doesn't produce any output because assignment doesn't display anything. If we want to check that our data has been loaded, we can print the variable's value:
surveys
Wow… that was a lot of output. At least it means the data loaded properly. Let's check the top (the first 6 lines) of this data.frame
using the function head()
:
head(surveys)
## survey_id month day year
## 1 8f58c260-4b8d-4efb-893c-84f0e4ce9e9a 7 16 1977
## 2 1445e762-ace8-4f00-b01d-d6f75a7ba753 7 16 1977
## 3 8c5fe0a9-1f7d-46d2-8e9d-237b310d3fde 7 16 1977
## 4 38ef8288-7db5-45df-9435-2f7ffe8a90ef 7 16 1977
## 5 4d9f6537-7aac-4713-a390-80877ef21f6f 7 16 1977
## 6 fae16612-97dc-4c09-bb74-ce0f3e2d3919 7 16 1977
## plot_id species sex wgt
## 1 aeece3f0-4acf-4aaa-87cd-e04ac6c8ad2c <NA> M NA
## 2 b564c1fc-bd87-4525-92ee-8bb36928fc10 <NA> M NA
## 3 aeece3f0-4acf-4aaa-87cd-e04ac6c8ad2c DM F NA
## 4 a7498949-7b94-4f8e-b253-3e26bbf80710 DM M NA
## 5 b564c1fc-bd87-4525-92ee-8bb36928fc10 DM M NA
## 6 b7eb122e-a7c4-4982-97a0-fa67e45e6faf PF M NA
At this point, make sure all participants have the data loaded
Let's now check the structure of this data.frame
in more details with the function str()
:
str(surveys)
Show how to get this information from the “Environment” tab in RStudio.
Based on the output of str(surveys)
, can you answer the following questions?
surveys
?data.frame
data.frame
is the de facto data structure for most tabular data and what we
use for statistics and plotting. They are a collection of vectors of identical
lengths (each representing a column), and each vector can be of a different data
type (e.g., characters, integers, factors). The str()
function is useful to
inspect the data types of the columns.
data.frame
can be created by the functions read.csv()
or read.table()
, in
other words, when importing spreadsheets from your hard drive (or the web).
By default, data.frame
converts (= coerces) columns that contain characters
(i.e., text) into the factor
data type. Depending on what you want to do with
the data, you may want to keep these columns as character
. To do so,
read.csv()
and read.table()
have an argument called stringsAsFactors
which
you can set to FALSE
:
some_data <- read.csv("data/some_file.csv", stringsAsFactors=FALSE)
You can also create data.frame
manually with the function data.frame()
. This
function can also take the argument stringsAsFactors
. Compare the output of
these examples:
Point to the difference between character and factor output with str()
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8))
str(example_data)
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8), stringsAsFactors=FALSE)
str(example_data)
Can you predict the class for each of the columns in the following example?
country_climate <- data.frame(country=c("Canada", "Panama", "South Africa", "Australia"),
climate=c("cold", "hot", "temperate", "hot/temperate"),
temperature=c(10, 30, 18, "15"),
north_hemisphere=c(TRUE, TRUE, FALSE, FALSE),
has_placentals=c(TRUE, TRUE, TRUE, "FALSE"))
Check your gueses using str(country_climate)
. Are they what you expected?
Why? why not?
R coerces (when possible) to the data type that is the least common denominator and the easiest to coerce to.
There are a few mistakes in this hand crafted data.frame
, can you spot and
fix them? Don't hesitate to experiment!
author_book <- data.frame(author_first=c("Charles", "Ernst", "Theodosius"),
author_last=c(Darwin, Mayr, Dobzhansky),
year=c(1942, 1970))
If we want to get a single value from the data frame, we must provide an index in square brackets, just as we do in math. For instance:
animals <- c("mouse", "rat", "dog", "cat")
animals[2]
## [1] "rat"
animals[c(3, 2)]
## [1] "dog" "rat"
animals[2:4]
## [1] "rat" "dog" "cat"
animals[c(1:3, 2:4)]
## [1] "mouse" "rat" "dog" "rat" "dog" "cat"
R indexes starting at 1. Programming languages like Fortran, MATLAB, and R start counting at 1, because that's what human beings have done for thousands of years. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that's simpler for computers to do.
:
is a special function that creates numeric vectors of integer in increasing
or decreasing order, test 1:10
and 10:1
for instance. The function seq()
(for sequence) can create more complex patterns:
seq(1, 10, by=2)
seq(1, 10, length.out=4)
seq(1, by=5, length.out=100)
seq(1, 8, by=3) # sequence stops to stay below upper limit
Our survey data is has rows and columns, if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers.
surveys[1, 1] # first element in the first column of the data frame
surveys[1, 6] # first element in the 6th column
surveys[1:3, 7] # first three elements in the 7th column
surveys[3, ] # the 3rd elements for all columns
surveys[, 8] # the entire 8th column
head_surveys <- surveys[1:6, ] # surveys[1:6, ] is equivalent to head(surveys)
nrow()
on a data.frame
returns the number of rows. Use it,
in conjuction with seq()
to create a new data.frame
called
surveys_by_10
that includes every 10th row of the survey data frame
starting at row 10 (10, 20, 30, …)When analyzing data, though, we often want to look at partial statistics, such as the maximum value per species or the average value per plot.
One way to do this is to select the data we want to create a new temporary array, using the subset() function
Let's look at just the animals of species 'DO'
speciesDO <- subset(dat, species == 'DO')
## Error: object 'dat' not found
We could see in our table from before that 'DO' had 3027 species. Let's check to see if that's what we have by checking the number of rows
nrow(speciesDO)
## Error: object 'speciesDO' not found
EXERCISE
Calculate the mean and standard deviation of just the DO species
data[3:3, 4:4]
produce?
What about data[3:3, 4:1]
? Explain the results to the person sitting next to youNow that our data is in memory, we can start doing things with it. First, let's ask what type of thing data refers to:
class(dat)
## Error: object 'dat' not found
str(dat)
## Error: object 'dat' not found
summary(dat)
## Error: object 'dat' not found
The class output tells us that data currently is a data.frame in R.
This is similar to a spreadsheet in excel, that many of us are familiar with using.
The str output tells us what columns there are and what type they are.
The summary output summarizes our columns and shows us the range of our values.
Useful functions
head()
- see first 6 rowstail()
- see last 6 rowsdim()
- see dimensionsnrow()
- number of rowsncol()
- number of columnsstr()
- structure of each columnnames()
- will list the names attribute for a data frame (or any object really), which gives the column names.str
output tells us the dimensions and the data types (int is integer) of each column.
We can see what its shape is like this:
dim(dat)
## Error: object 'dat' not found
nrow(dat)
## Error: object 'dat' not found
ncol(dat)
## Error: object 'dat' not found
This tells us that data has 35549 rows and 8 columns.
Let's look at the structure of the data again
str(dat)
## Error: object 'dat' not found
If we look at 'plot' it says it's an integer. $ plot : int 2 3 2 7 3 1 2 1 1 6 and it is an integer, but in our case we want it to be a factor. Basically it's a category. We don't want to be able to do math with it. We want to be able to ask things about that category. So we're going to change it from an integer to a factor in our data frame.
dat$plot <- as.factor(dat$plot)
## Error: object 'dat' not found
We've gotten our data in to R, so that we can do some analysis with it. First, let's get a sense of our data We might for instance want to know how many animals we trapped in each plot, or how many of each species were caught.
We can look at just one column at a time in different ways. We can reference that column by it's number
# Look at the weight column, the 8th one
dat[,8]
## Error: object 'dat' not found
or by its name
# Look at the weight column, by its name wgt
dat$wgt
## Error: object 'dat' not found
If you forget the column names, you can type
colnames(dat)
## Error: object 'dat' not found
or
str(dat)
## Error: object 'dat' not found
will show you the column names and the type of data in them
If we do this, we can see for instance that there are 48 species Factor w/ 48 levels means there are 48 different versions of that factor
So, let's get a list of all the species. The 'unique' command tells us all the unique names in that column.
unique(dat$species)
## Error: object 'dat' not found
Now let's see how many of each species we have
table(dat$species)
## Error: object 'dat' not found
We could even assign it to a variable and make it a data frame to make it easier to look at
species_table <- as.data.frame(table(dat$species))
## Error: object 'dat' not found
Maybe we also want to see how many animals were captured in each plot
table(dat$plot)
## Error: object 'dat' not found
Now we want to do some actual calculations with the data though. Let's calculate the average weight of all the animals. R has a lot of built in statistical functions, like mean, median, max, min
mean(dat$wgt)
## Error: object 'dat' not found
Hmm, we just get NA. That's because we don't have the weight for every animal and it's recorded as NA when we don't. We can't do math on NA. Conveniently R provides a function na.omit() that will omit NAs from your data.
How many animals would we omit. We can look at how many animals we have overall and subtract how many we have after the NAs are omitted.
Because data is in a vector, when we want to know how much of something we have we ask how long it is with the length() function.
length(dat$wgt)
## Error: object 'dat' not found
length(na.omit(dat$wgt))
## Error: object 'dat' not found
We can then subtract those numbers
length(dat$wgt) - length(na.omit(dat$wgt))
## Error: object 'dat' not found
We can see we'll be omitting 3266 animals. Bummer, but not terrible when we've sampled over 35,000 animals.
Let's calculate their average weight
mean(na.omit(dat$wgt))
## Error: object 'dat' not found
It gets a little annoying to type na.omit(dat$wgt) each time we want to do the calculation, so we can actually create a new data frame with the rows that have NA omitted with the complete.cases() command. Don't worry too much about this. You can google it to learn more about it if you need to use it.
dat2 <- dat[complete.cases(dat$wgt),]
## Error: object 'dat' not found
mean(dat2$wgt)
## Error: object 'dat2' not found
EXERCISES
R has a bunch of handy statistical functions built in. Calculate the median, standard deviation, minimum and maximum weight. For bonus points calculate the standard error.
What if we need the maximum weight for all animals, or the average for each plot? As the diagram below shows, we want to perform the operation across an axis:
To support this, in R we can use the apply
or 'tapply' function:
tapply() takes a vector, so we'll use that
help(tapply) #or ?apply
Apply allows us to repeat a function on all of the rows (1), columns (2), or both(1:2) of an array or matrix.
What if you wanted to now go on and calculate the average weight of each species. You could do this one by one, but you can actually do it all at once with the tapply() function.
The format is
tapply(data_you_want_to_calculate, factor_to_sort_on, function)
tapply(dat2$wgt, dat2$species, mean)
## Error: object 'dat2' not found
Now we can put all the means into a variable
species_means <- tapply(dat2$wgt, dat2$species, mean)
## Error: object 'dat2' not found
We still get NAs. That's because of the way that R keeps track of NAs when you're converting data frames. There are ways to get around that for the apply function, but we can use the aggregate function instead, which we'll cover next.
species_max <- tapply(dat2$wgt, dat2$species, max)
## Error: object 'dat2' not found
species_min <- tapply(dat2$wgt, dat2$species, min)
## Error: object 'dat2' not found
length(species_max)
## Error: object 'species_max' not found
length(species_min)
## Error: object 'species_min' not found
Now that we have all this summary information, we can put it back together into a data frame that we can use for further analysis and plotting, provided they are the same length.
d.summary = data.frame(species_means, species_min, species_max)
## Error: object 'species_means' not found
We can also do this with the aggregate function, which deals with the NA rows that we eliminated more nicely. This is a very useful function, and it puts the output in to a data frame rather than a list. If you look at documentation for aggregate() there's a few different ways to write the function. This is one way.
The format is aggregate(what-to-plot~what-you-want-it-sorted-by, data=the-dataset, function)
mean_species <- aggregate(wgt~species, data=dat, mean)
## Error: object 'dat' not found
Maybe we want to look at the data average per species per plot
aggregate(wgt~species+plot, data=dat, mean)
## Error: object 'dat' not found
Or we just want to look at the average of particular species in each plot. Then we can subset the data within the function
aggregate(wgt~species+plot, data=subset(dat, species == "DO"), mean)
## Error: object 'dat' not found
EXERCISES
sd_species <- aggregate(wgt~species, data=dat, sd)
## Error: object 'dat' not found
The mathematician Richard Hamming once said, “The purpose of computing is insight, not numbers,” and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture (or course) of its own, but we can explore a few features of R's base plotting package and ggplot2 here.
Lets use the average species data that we saved and plot it.
R has built in plotting functions
barplot(mean_species$wgt, names.arg=mean_species$species)
## Error: object 'mean_species' not found
The axis labels are too big though, so you can't see them all. Let's change that
barplot(mean_species$wgt, names.arg=mean_species$species, cex.names=0.4)
## Error: object 'mean_species' not found
and change the color
barplot(mean_species$wgt, names.arg=mean_species$species, cex.names=0.4, col=c("blue"))
## Error: object 'mean_species' not found
EXERCISES
barplot(sd_species$wgt, names.arg=sd_species$species)
## Error: object 'sd_species' not found
plot_mean <- aggregate(wgt~plot, data=dat, mean)
## Error: object 'dat' not found
barplot(plot_mean$wgt, names.arg=plot_mean$species, col=c("red"))
## Error: object 'plot_mean' not found
There's lots of different ways to plot things. You can use
help(barplot)
or search online
There's also a plotting package called ggplot that adds a lot of functionality. I'm not going to go through it, but you can see a gallery of what's possible for plotting with ggplot.
Basically, you can do almost anything, and you can spend infinite time refining it.
If you wanted to output this plot do a pdf rather than to the screen, you can specify where you want the plot to go with the 'pdf' command. If you wanted it to be a jpeg, you would set it as 'jpeg'
Be sure to add the 'dev.off()' command at the end. That command makes it so that the plots go back to getting printed within R. Otherwise every new plot you make will get printed to that pdf.
pdf("R_plot.pdf")
barplot(mean_species$wgt, names.arg=mean_species$species, cex.names=0.4, col=c("blue"))
## Error: object 'mean_species' not found
dev.off()
Package management
install.packages("package-name")
will download a package from one of the CRAN mirrors assuming that a binary is available for your operating system. If you have not set a preferred CRAN mirror in your options(), then a menu will pop up asking you to choose a location.
Use old.packages()
to list all your locally installed packages that are now out of date. update.packages()
will update all packages in the known libraries interactively. This can take a while if you haven't done it recently. To update everything without any user intervention, use the ask = FALSE
argument.
In RStudio, you can also do package management through Tools -> Install Packages.
Updating packages can sometimes make changes, so if you already have a lot of code in R, don't run this now. Otherwise let's just go ahead and update our pacakges so things are up to date.
update.packages(ask = FALSE)
Words are useful, but what's more useful are the sentences and stories we use them to build. Similarly, while a lot of powerful tools are built into languages like R, even more lives in the libraries they are used to build. Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench. Once it's done, we can ask the library to do things for us.
variable = value
to assign a value to a variable in order to record it in memory.dim()
gives the dimensions of a data frame or matrix.object[x, y]
to select a single element from an array.mean()
, max()
, min()
and sd()
to calculate simple statistics.ggplot2
library for creating simple visualizations.