tidyr

Remember the basic rules of tidy data structure

One column per type of information
One row per observation
One value in each cell

Unfortunately lots of existing data doesn’t follow these rules
Need to convert them to this tidy structure for analysis
Use a package called tidyr

install.packages("tidyr")
library(tidyr)
library(dplyr)

Pivot data from wide to long

One common issue is data spread over multiple columns that should be in one

Copy link to Western Ghats tree data from datasets page

raw_data = read_tsv("Macroplot_data_Rev.txt")

View data

Data on tree girth from the Western Ghats
Western Ghats is a mountainous region on the western edge of the Indian peninsula considered one of the worlds 8 biodiversity hotspots.
When a tree had multiple stems the diameter of each stem was entered in a separate column
What would a better structure be?

Lead discussion to correct structure

To convert the raw data into clean data we’ll use a data cleaning pipeline like the data manipulation pipelines we’ve developed previously
Let’s start by adding a treeid column to our data frame using the mutate function from dplyr
We want one treeid for each row because there is one tree for each row

clean_data <- raw_data |>
  mutate(treeid = 1:n())

To get the data in this form we can use pivot_longer
- Removes redundant columns
- Arguments:
  - data.frame
  - Columns to pivot
  - names_to: the name of the new column to put the column names in
  - values_to: the name of the new column to put the column values in

clean_data <- raw_data |>
  mutate(treeid = 1:nrow(raw_data)) |>
  pivot_longer(
    TreeGirth1:TreeGirth5,
    names_to = "stem",
    values_to = "girth"
  )

The colon specifies all columns starting at TreeGirth1 and ending at TreeGirth5
Alternatively we could specify all columns starting with TreeGirth

starts_with("TreeGirth")

View data

Still has zeros for where there were no stems, so filter these out

clean_data <- raw_data |>
  mutate(treeid = 1:nrow(raw_data)) |>
  pivot_longer(starts_with("TreeGirth"), names_to = "stem", values_to = "girth") |>
  filter(girth != 0)

Extract

Want stem column to contain numbers 1-5 not TreeGirth1
extract()
- Extracts one or more values from a column
- Uses regular expressions
- Arguments:
  - data.frame
  - Column name
  - Names of the new columns
  - Regular expression

clean_data <- raw_data |>
  mutate(treeid = 1:nrow(raw_data)) |>
  pivot_longer(starts_with("TreeGirth"), names_to = "stem", values_to = "girth") |>
  filter(girth != 0) |>
  extract(stem, 'stem', 'TreeGirth(.)')

TreeGirth. means the word “TreeGirth” followed by a single value
The () indicate what part of this to extract, so just the number at the end
This gives us the result we want, with just the stem number in the stem column
But you may notice that this number is on the left side of the column, not the right
That’s because the number is still stored as a character, because it was extracted from a string
To convert it to it’s actual type we can add the optional argument convert = TRUE to extract

clean_data <- raw_data |>
  mutate(treeid = 1:nrow(raw_data)) |>
  pivot_longer(starts_with("TreeGirth"), names_to = "stem", values_to = "girth") |>
  filter(girth != 0) |>
  extract(stem, 'stem', 'TreeGirth(.)', convert = TRUE)

This attempts to convert the values from characters to their actual type
This is a good addition when extracting numbers because then you can work with the column as numbers

Separate

Genus and species information are combined in a single column
separate()
- Separates multiple values in single column
- Arguments:
  - data.frame
  - Column name
  - New column names
  - Separator value, character, or position

clean_data <- raw_data |>
  pivot_longer(TreeGirth1:TreeGirth5, names_to = "stem", values_to = "girth") |>
  filter(girth != 0) |>
  extract(stem, 'stem', 'TreeGirth(.)') |>
  separate(SpCode, c('genus', 'species'), 4)

Unite and Pivot Wider

Sometimes we need to go in the other direction
Count the number of stems of each species on each plot

stem_counts <- clean_data |> 
  group_by(PlotID, genus, species) |> 
  summarize(count = n(), .groups = 'drop')

Software for running analysis requires cross-tab (or wide) data
PlotID in rows, species in columns, counts in cells
First need a single species ID
unite
- Combine values from multiple columns into one
- Arguments:
  - data.frame
  - New column name
  - Columns to combine

stem_counts_wide <- stem_counts |> 
  unite('species_id', genus, species)

Then make the data wide
pivot_wider
- Spread values across multiple columns
- Arguments:
  - data.frame
  - Name of column to use for wide columns
  - Name of column containing the values for the cells

stem_counts_wide <- stem_counts |> 
  unite(species_id, genus, species) |>
  pivot_wider(names_from = species_id, values_from = count)

This leaves null values when there is no value in the starting table
But we can replace this with something else using values_fill

stem_counts_wide <- stem_counts |> 
  unite(species_id, genus, species) |>
  pivot_wider(names_from = species_id,
              values_from = count,
              values_fill = 0)

Completing data with gaps

Some write out a value once and then leave the following rows blank

gappy_data <- read_csv("http://www.datacarpentry.org/semester-biology/data/gappy-data.csv")
gappy_data |>
  print(n = 28)

This works well for humans, but not for computers
Can fill in these gaps using fill

clean_data <- gappy_data |>
  fill(Species)
clean_data |>
  print(n = 28)

Fills down by default, but other directions are possible
Often data only includes observed values, but we need to list other values
Missing zeros or NA’s
In our data we have 3 species and up to 10 individuals/species
But only 28 values because some species have data for <10 individuals
Use complete to provide one row for each unique combination of the values in the columns provided

clean_data <- gappy_data |>
  fill(Species) |>
  complete(Species, Individual)

This fills in NA’s for any combination of the species ID and the individual number (numbers 1-10) not already in the data
Can also use this to add zeros to our long stem_counts data frame

stem_counts |>
  complete(PlotID, nesting(genus, species), fill = list(count = 0))

Notes

Remember the basic rules of tidy data structure

Pivot data from wide to long

Extract

Separate

Unite and Pivot Wider

Completing data with gaps