Remember the basic rules of tidy data structure

  1. One column per type of information
  2. One row per observation
  3. One value in each cell
  • Unfortunately lots of existing data doesn’t follow these rules
  • Need to convert them to this tidy structure for analysis
  • Use a package called tidyr
install.packages("tidyr")
library(tidyr)
library(dplyr)

Pivot data from wide to long

  • One common issue is data spread over multiple columns that should be in one

Copy link to Western Ghats tree data from datasets page

raw_data = read_tsv("Macroplot_data_Rev.txt")

View data

  • Data on tree girth from the Western Ghats
  • Western Ghats is a mountainous region on the western edge of the Indian peninsula considered one of the worlds 8 biodiversity hotspots.
  • When a tree had multiple stems the diameter of each stem was entered in a separate column
  • What would a better structure be?

Lead discussion to correct structure

  • To convert the raw data into clean data we’ll use a data cleaning pipeline like the data manipulation pipelines we’ve developed previously
  • Let’s start by adding a treeid column to our data frame using the mutate function from dplyr
  • We want one treeid for each row because there is one tree for each row
clean_data <- raw_data |>
  mutate(treeid = 1:n())
  • To get the data in this form we can use pivot_longer
    • Removes redundant columns
    • Arguments:
      • data.frame
      • Columns to pivot
      • names_to: the name of the new column to put the column names in
      • values_to: the name of the new column to put the column values in
clean_data <- raw_data |>
  mutate(treeid = 1:nrow(raw_data)) |>
  pivot_longer(
    TreeGirth1:TreeGirth5,
    names_to = "stem",
    values_to = "girth"
  )
  • The colon specifies all columns starting at TreeGirth1 and ending at TreeGirth5
  • Alternatively we could specify all columns starting with TreeGirth
starts_with("TreeGirth")

View data

  • Still has zeros for where there were no stems, so filter these out
clean_data <- raw_data |>
  mutate(treeid = 1:nrow(raw_data)) |>
  pivot_longer(starts_with("TreeGirth"), names_to = "stem", values_to = "girth") |>
  filter(girth != 0)

Extract

  • Want stem column to contain numbers 1-5 not TreeGirth1
  • extract()
    • Extracts one or more values from a column
    • Uses regular expressions
    • Arguments:
      • data.frame
      • Column name
      • Names of the new columns
      • Regular expression
clean_data <- raw_data |>
  mutate(treeid = 1:nrow(raw_data)) |>
  pivot_longer(starts_with("TreeGirth"), names_to = "stem", values_to = "girth") |>
  filter(girth != 0) |>
  extract(stem, 'stem', 'TreeGirth(.)')
  • TreeGirth. means the word “TreeGirth” followed by a single value
  • The () indicate what part of this to extract, so just the number at the end

  • This gives us the result we want, with just the stem number in the stem column
  • But you may notice that this number is on the left side of the column, not the right
  • That’s because the number is still stored as a character, because it was extracted from a string
  • To convert it to it’s actual type we can add the optional argument convert = TRUE to extract
clean_data <- raw_data |>
  mutate(treeid = 1:nrow(raw_data)) |>
  pivot_longer(starts_with("TreeGirth"), names_to = "stem", values_to = "girth") |>
  filter(girth != 0) |>
  extract(stem, 'stem', 'TreeGirth(.)', convert = TRUE)
  • This attempts to convert the values from characters to their actual type
  • This is a good addition when extracting numbers because then you can work with the column as numbers

Separate

  • Genus and species information are combined in a single column
  • separate()
    • Separates multiple values in single column
    • Arguments:
      • data.frame
      • Column name
      • New column names
      • Separator value, character, or position
clean_data <- raw_data |>
  pivot_longer(TreeGirth1:TreeGirth5, names_to = "stem", values_to = "girth") |>
  filter(girth != 0) |>
  extract(stem, 'stem', 'TreeGirth(.)') |>
  separate(SpCode, c('genus', 'species'), 4)

Unite and Pivot Wider

  • Sometimes we need to go in the other direction
  • Count the number of stems of each species on each plot
stem_counts <- clean_data |> 
  group_by(PlotID, genus, species) |> 
  summarize(count = n(), .groups = 'drop')
  • Software for running analysis requires cross-tab (or wide) data
  • PlotID in rows, species in columns, counts in cells
  • First need a single species ID
  • unite
    • Combine values from multiple columns into one
    • Arguments:
      • data.frame
      • New column name
      • Columns to combine
stem_counts_wide <- stem_counts |> 
  unite('species_id', genus, species)
  • Then make the data wide
  • pivot_wider
    • Spread values across multiple columns
    • Arguments:
      • data.frame
      • Name of column to use for wide columns
      • Name of column containing the values for the cells
stem_counts_wide <- stem_counts |> 
  unite(species_id, genus, species) |>
  pivot_wider(names_from = species_id, values_from = count)
  • This leaves null values when there is no value in the starting table
  • But we can replace this with something else using values_fill
stem_counts_wide <- stem_counts |> 
  unite(species_id, genus, species) |>
  pivot_wider(names_from = species_id,
              values_from = count,
              values_fill = 0)

Completing data with gaps

  • Some write out a value once and then leave the following rows blank
gappy_data <- read_csv("http://www.datacarpentry.org/semester-biology/data/gappy-data.csv")
gappy_data |>
  print(n = 28)
  • This works well for humans, but not for computers
  • Can fill in these gaps using fill
clean_data <- gappy_data |>
  fill(Species)
clean_data |>
  print(n = 28)
  • Fills down by default, but other directions are possible

  • Often data only includes observed values, but we need to list other values
  • Missing zeros or NA’s
  • In our data we have 3 species and up to 10 individuals/species
  • But only 28 values because some species have data for <10 individuals
  • Use complete to provide one row for each unique combination of the values in the columns provided
clean_data <- gappy_data |>
  fill(Species) |>
  complete(Species, Individual)
  • This fills in NA’s for any combination of the species ID and the individual number (numbers 1-10) not already in the data
  • Can also use this to add zeros to our long stem_counts data frame
stem_counts |>
  complete(PlotID, nesting(genus, species), fill = list(count = 0))