Learning Objectives
- being able to subset for a level
- being able to extract particular values from a
data.frame
- being able to add/remove columns and rows
- being able to export our data back to CSV
To refresh our memories from yesterday, let’s all begin with an exercise:
data.frame
In particular for larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. (Are species names in column 5 or 7? oh, right… they are in column 6). In some cases, in which column the variable will be can change if the script you are using adds or removes columns. It’s therefore often better to use column names to refer to a particular variable, and it makes your code easier to read and your intentions clearer.
You can do operations on a particular column, by selecting it using the $
sign. In this case, the entire column is a vector. For instance, to extract all the weights from our datasets, we can use: surveys$wgt
. You can use names(surveys)
or colnames(surveys)
to remind yourself of the column names.
In some cases, you may want to select more than one column. You can do this using square brackets: surveys[, c("wgt", "sex")]
.
When analyzing data, though, we often want to look at partial statistics, such as the maximum value of a variable per species or the average value per plot.
We’ve already seen how to select data with logical indexing, using a logical expression such as surveys[surveys$species=="DO",]
.
Another nice way to do this is the subset()
function. For instance, if we just want to look at the animals of the species “DO”:
surveys_DO <- subset(surveys, species == "DO")
## What does the following do? (Try to guess without executing it)
surveys_DO$month[2] <- 8
subset
twice to create a data.frame
that contains all individuals of the species “DM” that were collected in 2002.Question: How many individuals of the species “DM” were collected in 2002?
&
(to mean AND), |
(to mean OR), and !
(to mean NOT). Use the &
operator to find the same data subset as in the previous question, without needing to use the subset()
function more than once.Sometimes, you may have to add a new column to your dataset that represents a new variable. You can add columns to a data.frame
using the function cbind()
(column bind). Beware, the additional column must have the same number of elements as there are rows in the data.frame
.
In our survey dataset, the species are represented by a 2-letter code (e.g., “AB”), however, we would like to include the species name. The correspondance between the 2 letter code and the names are in the file species.csv
. In this file, one column includes the genus and another includes the species. First, we are going to import this file in memory:
species <- read.csv("data/species.csv")
We are then going to use the function match()
to create a vector that contains the genus names for all our observations. The function match()
takes at least 2 arguments: the values to be matched (in our case the 2 letter code from the surveys
data frame held in the column species
), and the table that contains the values to be matched against (in our case the 2 letter code in the species
data frame held in the column species_id
). The function returns the position of the matches in the table, and this can be used to retrieve the genus names:
surveys_spid_index <- match(surveys$species, species$species_id)
surveys_genera <- species$genus[surveys_spid_index]
Now that we have our vector of genus names, we can add it as a new column to our surveys
object:
surveys <- cbind(surveys, genus=surveys_genera)
Use the same approach to also include the species names in the surveys
data frame.
## and check out the result
head(surveys)
## record_id month day year plot species sex wgt genus species_name
## 1 1 7 16 1977 2 NL M NA Neotoma albigula
## 2 2 7 16 1977 3 NL M NA Neotoma albigula
## 3 3 7 16 1977 2 DM F NA Dipodomys merriami
## 4 4 7 16 1977 7 DM M NA Dipodomys merriami
## 5 5 7 16 1977 3 DM M NA Dipodomys merriami
## 6 6 7 16 1977 1 PF M NA Perognathus flavus
Let’s create a data.frame
that contains the information only for the species “DO” and “DM”. We know how to create the data set for each species with the function subset()
:
surveys_DO <- subset(surveys, species == "DO")
surveys_DM <- subset(surveys, species == "DM")
Similarly to cbind()
for columns, there is a function rbind()
(row bind) that puts together two data.frame
. With rbind()
the number of columns and their names must be identical between the two objects:
surveys_DO_DM <- rbind(surveys_DO, surveys_DM)
Using a similar approach, construct a new data.frame
that only includes data for the years 2000 and 2001.
Just like you can select columns by their positions in the data.frame
or by their names, you can remove them similarly.
To remove it by column number:
surveys_noDate <- surveys[, -c(2:4)]
colnames(surveys)
## [1] "record_id" "month" "day" "year"
## [5] "plot" "species" "sex" "wgt"
## [9] "genus" "species_name"
colnames(surveys_noDate)
## [1] "record_id" "plot" "species" "sex"
## [5] "wgt" "genus" "species_name"
The easiest way to remove by name is to use the subset()
function. This time we need to specify explicitly the argument select
as the default is to subset on rows (as above). The minus sign indicates the names of the columns to remove (note that the column names should not be quoted):
surveys_noDate2 <- subset(surveys, select=-c(month, day, year))
colnames(surveys_noDate2)
## [1] "record_id" "plot" "species" "sex"
## [5] "wgt" "genus" "species_name"
Typically rows are not associated with names, so to remove them from the data.frame
, you can do:
surveys_missingRows <- surveys[-c(10, 50:70), ] # removing rows 10, and 50 to 70
Now that we’ve made changes to our data frame, let’s see how to write it back out to a CSV file. We read the file in with read.csv()
, and we can write the data out with write.csv()
(or more generally, write.table()
).
Let’s try to write out our data frame with some rows removed:
write.csv(surveys_missingRows, 'data/surveys_missingrows.csv')
This will create the file data/surveys_missingrows.csv
and write the data frame to it.
Note that, unlike read.csv()
, write.csv()
needs to take both the data frame and the filename as arguments.
Open your new file data/surveys_missingrows.csv
in a text editor. Does it look as you expected? Compare it to surveys.csv
.
You may notice that there is an extra column. This is because data frames in R have not only column names, but also row names. By default, these row names will be written out to the CSV file. You can change this option with the row.names
argument to write.csv()
There are several different options when writing a CSV file. You can.use the help functions to see the details of other arguments.
Exercise: Try to write out your data frame without row names, so that you get a CSV file of the same style as the original.
Previous: Data Frames Next: Analyzing data