Learning Objectives
- understand the concept of a
data.frame
- use sequences
- know how to access any element of a
data.frame
- use sorting and indexing to sort the rows of a
data.frame
data.frame
is the de facto data structure for most tabular data and what we use for statistics and plotting.
A data.frame
is a collection of vectors of identical lengths. Each vector represents a column, and each vector can be of a different data type (e.g., characters, integers, factors). The str()
function is useful to inspect the data types of the columns.
A data.frame
can be created by the functions read.csv()
or read.table()
, in other words, when importing spreadsheets from your hard drive (or the web).
By default, data.frame
converts (= coerces) columns that contain characters (i.e., text) into the factor
data type. Depending on what you want to do with the data, you may want to keep these columns as character
. To do so, read.csv()
and read.table()
have an argument called stringsAsFactors
which can be set to FALSE
:
some_data <- read.csv("data/some_file.csv", stringsAsFactors=FALSE)
You can also create data.frame
manually with the function data.frame()
. This function can also take the argument stringsAsFactors
. Compare the output of these examples:
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8))
str(example_data)
## 'data.frame': 4 obs. of 3 variables:
## $ animal: Factor w/ 4 levels "cat","dog","sea cucumber",..: 2 1 3 4
## $ feel : Factor w/ 3 levels "furry","spiny",..: 1 1 3 2
## $ weight: num 45 8 1.1 0.8
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8), stringsAsFactors=FALSE)
str(example_data)
## 'data.frame': 4 obs. of 3 variables:
## $ animal: chr "dog" "cat" "sea cucumber" "sea urchin"
## $ feel : chr "furry" "furry" "squishy" "spiny"
## $ weight: num 45 8 1.1 0.8
data.frame
, can you spot and fix them? Don’t hesitate to experiment!## There are a few mistakes in this hand crafted `data.frame`,
## can you spot and fix them? Don't hesitate to experiment!
author_book <- data.frame(author_first=c("Charles", "Ernst", "Theodosius"),
author_last=c(Darwin, Mayr, Dobzhansky),
year=c(1942, 1970))
## Can you predict the class for each of the columns in the following example?
## Check your guesses using `str(country_climate)`. Are they what you expected?
## Why? why not?
country_climate <- data.frame(country=c("Canada", "Panama", "South Africa", "Australia"),
climate=c("cold", "hot", "temperate", "hot/temperate"),
temperature=c(10, 30, 18, "15"),
north_hemisphere=c(TRUE, TRUE, FALSE, "FALSE"),
has_kangaroo=c(FALSE, FALSE, FALSE, 1))
Check your gueses using str(country_climate)
. Are they what you expected? Why? Why not?
R coerces (when possible) to the data type that is the least common denominator and the easiest to coerce to.
Can you fix the R code above statement so that the fields of country_climate
have the types you would like?
We already saw how the functions head()
and str()
can be useful to check the content and the structure of a data.frame
. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data.
dim()
- returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)nrow()
- returns the number of rowsncol()
- returns the number of columnshead()
- shows the first 6 rowstail()
- shows the last 6 rowsnames()
- returns the column names (synonym of colnames()
for data.frame
objects)rownames()
- returns the row namesstr()
- structure of the object and information about the class, length and content of each columnsummary()
- summary statistics for each columnNote: most of these functions are “generic”, they can be used on other types of objects besides data.frame
.
If we want to extract one or several values from a vector, we must provide one or several indices in square brackets, just as we do in math. For instance:
animals <- c("mouse", "rat", "dog", "cat")
animals[2]
## [1] "rat"
animals[c(3, 2)]
## [1] "dog" "rat"
animals[2:4]
## [1] "rat" "dog" "cat"
more_animals <- animals[c(1:3, 2:4)]
more_animals
## [1] "mouse" "rat" "dog" "rat" "dog" "cat"
R indexes start at 1. Programming languages like Fortran, MATLAB, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.
:
is a special function that creates numeric vectors of integer in increasing or decreasing order. Test 1:10
and 10:1
for instance. The function seq()
(for sequence) can be used to create more complex patterns:
seq(1, 10, by=2)
## [1] 1 3 5 7 9
seq(5, 10, length.out=3)
## [1] 5.0 7.5 10.0
seq(50, by=5, length.out=10)
## [1] 50 55 60 65 70 75 80 85 90 95
seq(1, 8, by=3) # sequence stops to stay below upper limit
## [1] 1 4 7
Our survey data frame has rows and columns (it has 2 dimensions). If we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers.
surveys[1, 1] # first element in the first column of the data frame
surveys[1, 6] # first element in the 6th column
surveys[1:3, 7] # first three elements in the 7th column
surveys[3, ] # the 3rd element for all columns
surveys[, 8] # the entire 8th column
head_surveys <- surveys[1:6, ] # surveys[1:6, ] is equivalent to head(surveys)
surveys$species # the species column
nrow()
on a data.frame
returns the number of rows. Use it, in conjuction with seq()
to create a new data.frame
called surveys_by_10
that includes every 10th row of the survey data frame starting at row 10 (10, 20, 30, …)Logical values can be used to select a subset of the data.
colors <- c("red", "yellow", NA, "blue")
colors[c(TRUE, FALSE, FALSE, TRUE)] # elements corresponding TRUE positions
## [1] "red" "blue"
This can be useful when you have a function that produces TRUE/FALSE and you can then use this to subset your data. For example, removing NA (null) values.
is.na(colors) # logical vector TRUE if position is NA
## [1] FALSE FALSE TRUE FALSE
!is.na(colors) # not (!) missing, FALSE if position is NA
## [1] TRUE TRUE FALSE TRUE
colors[!is.na(colors)] # colours that are not missing (NA)
## [1] "red" "yellow" "blue"
example_data <- data.frame(fruit=c("apple", "mango", "banana", "coconut"),
quantity=c(45, 8, 12, 2))
example_data$quantity > 10 # quantity is greater than 10 (TRUE/FALSE)
## [1] TRUE FALSE TRUE FALSE
example_data[example_data$quantity > 10,] # rows where quantity is greater than 10
## fruit quantity
## 1 apple 45
## 3 banana 12
fruit_info <- data.frame(fruit=c("apple", "mango", "banana", "coconut"),
quantity=c(45, 8, 12, 2))
order(fruit_info$quantity) # The order of the quantity column
## [1] 4 2 3 1
correct_row_order <- order(fruit_info$quantity)
fruit_info[correct_row_order,] # sort fruit_info by quantity column
## fruit quantity
## 4 coconut 2
## 2 mango 8
## 3 banana 12
## 1 apple 45
From the surveys data frame, print a copy of the weights, without any NA values.
Produce a copy of the surveys data frame with only rows that contain weights.
Sort the entire fruit_info data frame by reverse (descending) alphabetical order of the fruit names. Hint: args(order).
Previous: Starting with data Next: Manipulating data