Remember the basic rules of tidy data structure
- One column per type of information
- One row per observation
- One value in each cell
- Unfortunately lots of existing data doesn’t follow these rules
- Need to convert them to this tidy structure for analysis
- Use a package called
tidyr
install.packages("tidyr")
library(tidyr)
library(dplyr)
Pivot data from wide to long
- One common issue is data spread over multiple columns that should be in one
Copy link to Western Ghats tree data from datasets page
raw_data = read_tsv("Macroplot_data_Rev.txt")
View data
- Data on tree girth from the Western Ghats
- Western Ghats is a mountainous region on the western edge of the Indian peninsula considered one of the worlds 8 biodiversity hotspots.
- When a tree had multiple stems the diameter of each stem was entered in a separate column
- What would a better structure be?
Lead discussion to correct structure
- To convert the raw data into clean data we’ll use a data cleaning pipeline like the data manipulation pipelines we’ve developed previously
- Let’s start by adding a
treeidcolumn to our data frame using themutatefunction fromdplyr - We want one
treeidfor each row because there is one tree for each row
clean_data <- raw_data |>
mutate(treeid = 1:n())
- To get the data in this form we can use
pivot_longer- Removes redundant columns
- Arguments:
data.frame- Columns to pivot
names_to: the name of the new column to put the column names invalues_to: the name of the new column to put the column values in
clean_data <- raw_data |>
mutate(treeid = 1:nrow(raw_data)) |>
pivot_longer(
TreeGirth1:TreeGirth5,
names_to = "stem",
values_to = "girth"
)
- The colon specifies all columns starting at
TreeGirth1and ending atTreeGirth5 - Alternatively we could specify all columns starting with
TreeGirth
starts_with("TreeGirth")
View data
- Still has zeros for where there were no stems, so filter these out
clean_data <- raw_data |>
mutate(treeid = 1:nrow(raw_data)) |>
pivot_longer(starts_with("TreeGirth"), names_to = "stem", values_to = "girth") |>
filter(girth != 0)
Extract
- Want
stemcolumn to contain numbers 1-5 notTreeGirth1 extract()- Extracts one or more values from a column
- Uses regular expressions
- Arguments:
data.frame- Column name
- Names of the new columns
- Regular expression
clean_data <- raw_data |>
mutate(treeid = 1:nrow(raw_data)) |>
pivot_longer(starts_with("TreeGirth"), names_to = "stem", values_to = "girth") |>
filter(girth != 0) |>
extract(stem, 'stem', 'TreeGirth(.)')
TreeGirth.means the word “TreeGirth” followed by a single value-
The
()indicate what part of this to extract, so just the number at the end - This gives us the result we want, with just the stem number in the
stemcolumn - But you may notice that this number is on the left side of the column, not the right
- That’s because the number is still stored as a character, because it was extracted from a string
- To convert it to it’s actual type we can add the optional argument
convert = TRUEtoextract
clean_data <- raw_data |>
mutate(treeid = 1:nrow(raw_data)) |>
pivot_longer(starts_with("TreeGirth"), names_to = "stem", values_to = "girth") |>
filter(girth != 0) |>
extract(stem, 'stem', 'TreeGirth(.)', convert = TRUE)
- This attempts to convert the values from characters to their actual type
- This is a good addition when extracting numbers because then you can work with the column as numbers
Separate
- Genus and species information are combined in a single column
separate()- Separates multiple values in single column
- Arguments:
data.frame- Column name
- New column names
- Separator value, character, or position
clean_data <- raw_data |>
pivot_longer(TreeGirth1:TreeGirth5, names_to = "stem", values_to = "girth") |>
filter(girth != 0) |>
extract(stem, 'stem', 'TreeGirth(.)') |>
separate(SpCode, c('genus', 'species'), 4)
Unite and Pivot Wider
- Sometimes we need to go in the other direction
- Count the number of stems of each species on each plot
stem_counts <- clean_data |>
group_by(PlotID, genus, species) |>
summarize(count = n(), .groups = 'drop')
- Software for running analysis requires cross-tab (or wide) data
- PlotID in rows, species in columns, counts in cells
- First need a single species ID
unite- Combine values from multiple columns into one
- Arguments:
data.frame- New column name
- Columns to combine
stem_counts_wide <- stem_counts |>
unite('species_id', genus, species)
- Then make the data wide
pivot_wider- Spread values across multiple columns
- Arguments:
data.frame- Name of column to use for wide columns
- Name of column containing the values for the cells
stem_counts_wide <- stem_counts |>
unite(species_id, genus, species) |>
pivot_wider(names_from = species_id, values_from = count)
- This leaves null values when there is no value in the starting table
- But we can replace this with something else using
values_fill
stem_counts_wide <- stem_counts |>
unite(species_id, genus, species) |>
pivot_wider(names_from = species_id,
values_from = count,
values_fill = 0)
Completing data with gaps
- Some write out a value once and then leave the following rows blank
gappy_data <- read_csv("http://www.datacarpentry.org/semester-biology/data/gappy-data.csv")
gappy_data |>
print(n = 28)
- This works well for humans, but not for computers
- Can fill in these gaps using
fill
clean_data <- gappy_data |>
fill(Species)
clean_data |>
print(n = 28)
-
Fills down by default, but other directions are possible
- Often data only includes observed values, but we need to list other values
- Missing zeros or
NA’s - In our data we have 3 species and up to 10 individuals/species
- But only 28 values because some species have data for <10 individuals
- Use
completeto provide one row for each unique combination of the values in the columns provided
clean_data <- gappy_data |>
fill(Species) |>
complete(Species, Individual)
- This fills in
NA’s for any combination of the species ID and the individual number (numbers 1-10) not already in the data - Can also use this to add zeros to our long
stem_countsdata frame
stem_counts |>
complete(PlotID, nesting(genus, species), fill = list(count = 0))