Learning Objectives

Following this assignment students should be able to:

  • use and create vectorized functions
  • use the apply family of functions for iteration
  • integrate custom functions with dplyr for iteration

Reading

Lecture Notes

Setup

install.packages(c('dplyr', 'ggplot2', 'readr', 'tidyr'))
download.file("https://datacarpentry.org/semester-biology/data/dinosaur_lengths.csv",
  "dinosaur_lengths.csv")
download.file("https://datacarpentry.org/semester-biology/data/ramesh2010-macroplots.csv",
  "ramesh2010-macroplots.csv")
download.file("https://ndownloader.figshare.com/files/5629536",
  "TREE_SURVEYS.txt")
download.file("https://ndownloader.figshare.com/files/2292172",
  "surveys.csv")
download.file("https://ndownloader.figshare.com/files/3299474",
  "plots.csv")
download.file("https://ndownloader.figshare.com/files/3299483",
  "species.csv")

Lecture Notes


Place this code at the start of the assignment to load all the required packages.

library(dplyr)
library(ggplot2)

Exercises

  1. Size Estimates Vectorized (10 pts)

    The length of an organism is typically strongly correlated with its body mass. This is useful because it allows us to estimate the mass of an organism even if we only know its length. This relationship generally takes the form:

    mass = a * lengthb

    Parameters a and b vary among groups.

    1. Write a function named mass_from_length_theropoda() that takes length as an argument to get an estimate of mass values for the dinosaur Theropoda. Use the equation mass <- 0.73 * length^3.63. Copy the data below into R and pass the entire vector to your function to calculate the estimated mass for each dinosaur.

      theropoda_lengths <- c(17.8013631070471, 20.3764452071665, 14.0743486294308, 25.65782386974, 26.0952008049675, 20.3111541103134, 17.5663244372533, 11.2563431277577, 20.081903202614, 18.6071626441984, 18.0991894513166, 23.0659685685892, 20.5798853467837, 25.6179254233558, 24.3714331573996, 26.2847248252537, 25.4753783544473, 20.4642089867304, 16.0738256364701, 20.3494171706583, 19.854399305869, 17.7889814608919, 14.8016421998303, 19.6840911485379, 19.4685885050906, 24.4807784966691, 13.3359960054899, 21.5065994598917, 18.4640304608411, 19.5861532398676, 27.084751999756, 18.9609366301798, 22.4829168046521, 11.7325716149514, 18.3758846100456, 15.537504851634, 13.4848751773738, 7.68561192214935, 25.5963348603783, 16.588285389794)

    2. Create a new version of the function named mass_from_length() to use the equation mass <- a * length^b and take length, a and b as arguments. In the function arguments, set the default values for a to 0.73 and b to 3.63. If you run this function with just the length data from Part 1, you should get the same result as Part 1. Copy the data below into R and call your function using the vector of lengths from Part 1 (above) and these vectors of a and b values to estimate the mass for the dinosaurs using different values of a and b.

      a_values <- c(0.759, 0.751, 0.74, 0.746, 0.759, 0.751, 0.749, 0.751, 0.738, 0.768, 0.736, 0.749, 0.746, 0.744, 0.749, 0.751, 0.744, 0.754, 0.774, 0.751, 0.763, 0.749, 0.741, 0.754, 0.746, 0.755, 0.764, 0.758, 0.76, 0.748, 0.745, 0.756, 0.739, 0.733, 0.757, 0.747, 0.741, 0.752, 0.752, 0.748)

      b_values <- c(3.627, 3.633, 3.626, 3.633, 3.627, 3.629, 3.632, 3.628, 3.633, 3.627, 3.621, 3.63, 3.631, 3.632, 3.628, 3.626, 3.639, 3.626, 3.635, 3.629, 3.642, 3.632, 3.633, 3.629, 3.62, 3.619, 3.638, 3.627, 3.621, 3.628, 3.628, 3.635, 3.624, 3.621, 3.621, 3.632, 3.627, 3.624, 3.634, 3.621)

    3. Create a data frame for this data using dino_data <- data.frame(theropoda_lengths, a_values, b_values). Use dplyr to add a new masses column to this data frame (using mutate() and your function) and print the result to the console.

    Expected outputs for Size Estimates Vectorized
  2. Size Estimates With Maximum (10 pts)

    Write a function named named mass_from_length_max that takes length as an argument. If length is less than 20 estimate the mass of the dinosaur using the equation mass <- 0.73 * length ^ 3.63. If length is greater than or equal to 20 return NA instead.

    Copy the data below into R and use sapply() and this new function to estimate the mass for every length in dinosaur_lengths.

    dinosaur_lengths <- c(17.8013631070471, 20.3764452071665, 14.0743486294308, 25.65782386974, 26.0952008049675, 20.3111541103134, 17.5663244372533, 11.2563431277577, 20.081903202614, 18.6071626441984, 18.0991894513166, 23.0659685685892, 20.5798853467837, 25.6179254233558, 24.3714331573996, 26.2847248252537, 25.4753783544473, 20.4642089867304, 16.0738256364701, 20.3494171706583, 19.854399305869, 17.7889814608919, 14.8016421998303, 19.6840911485379, 19.4685885050906, 24.4807784966691, 13.3359960054899, 21.5065994598917, 18.4640304608411, 19.5861532398676, 27.084751999756, 18.9609366301798, 22.4829168046521, 11.7325716149514, 18.3758846100456, 15.537504851634, 13.4848751773738, 7.68561192214935, 25.5963348603783, 16.588285389794)
    
    Expected outputs for Size Estimates With Maximum
  3. Size Estimates By Name Apply (20 pts)

    If the data on dinosaur lengths with species names is not in your working directory then download it. Import it using read_csv().

    The following function estimates a dinosaur’s mass based on its length and name of its taxonomic group:

    get_mass_from_length_by_name <- function(length, name){
      if (name == "Stegosauria"){
        mass = 10.95 * length ^ 2.64
      }
      else if (name == "Theropoda"){
        mass = 0.73 * length ^ 3.63
      }
      else if (name == "Sauropoda"){
        mass = 214.44 * length ^ 1.46
      }
      else {
        mass = NA
      }
      return(mass)
    }
    
    1. Copy this function into your code and then use this function and mapply() to calculate the estimated mass for each dinosaur. You’ll need to pass the data to mapply() as single vectors or columns, not the whole data frame.

    2. Using dplyr, add a new masses column to the data frame (using rowwise(), mutate() and your function) and print the result to the console.

    3. Using ggplot, make a histogram of dinosaur masses with one subplot for each species (using facet_wrap()).

    Expected outputs for Size Estimates By Name Apply
  4. Crown Volume Calculation (25 pts)

    The UHURU experiment in Kenya has conducted a survey of Acacia and other tree species in ungulate exclosure treatments. If TREE_SURVEYS.txt is not on in your working directory then download a copy. Each of the individuals surveyed were measured for tree height (HEIGHT) and canopy size in two directions (AXIS_1 and AXIS_2). Read these data in using the following code:

    tree_data <- read_tsv("TREE_SURVEYS.txt",
                          na = c("dead", "missing", "MISSING",
                                 "NA", "?", "3.3."))
    

    You want to estimate the crown volumes for the different species and have developed equations for species in the Acacia genus:

    volume <- 0.16 * HEIGHT ^ 0.8 * pi * AXIS_1 * AXIS_2
    

    and the Balanites genus:

    volume <- 1.2 * HEIGHT ^ 0.26 * pi * AXIS_1 * AXIS_2
    

    For all other genera you’ll use a general equation developed for trees:

    volume <- 0.5 * HEIGHT ^ 0.6 * pi * AXIS_1 * AXIS_2
    
    1. Write a function called tree_volume_calc that calculates the canopy volume for the Acacia species in the dataset. To do so, use an if statement in combination with the str_detect() function from the stringr R package. The code str_detect(SPECIES, "Acacia") will return TRUE if the string stored in this variable contains the word “Acacia” and FALSE if it does not. This function will have to take the following arguments as input: SPECIES, HEIGHT, AXIS_1, AXIS_2. Then run the following line:

      tree_volume_calc("Acacia_brevispica", 2.2, 3.5, 1.12)

    2. Expand this function to additionally calculate canopy volumes for other types of trees in this dataset by adding if/else statements and including the volume equations for the Balanites genus and other genera. Then run the following lines:

      tree_volume_calc("Balanites", 2.2, 3.5, 1.12) tree_volume_calc("Croton", 2.2, 3.5, 1.12)

    3. Now get the canopy volumes for all the trees in the tree_data dataframe and add them as a new column to the data frame. You can do this using tree_volume_calc() and either mapply() or using dplyr with rowwise and mutate.

    Expected outputs for Crown Volume Calculation
  5. Portal Data Iteration Without Loops (25 pts)

    This exercise covers iteration without loops in R using Portal data. You’ll practice vectorization, apply functions, and integration with dplyr using real ecological data from the Portal Project.

    If surveys.csv, species.csv, and plots.csv are not in your working directory then download them.

    Load the three data files using read_csv.

    1. Create a vectorized function called estimate_metabolic_rate that takes weight as input and returns metabolic rate using the equation: metabolic_rate = 0.073 * weight ^ 0.75. Run it on the following vector:

    weights <- c(15, 25, 35, 45, 20, 70, 72).

    2. Use mutate() and estimate_metabolic_rate to create a version of the data in surveys with a column called metabolic_rate for all animals that have weight measurements. Remove the rows without metabolic rates. Select the year, species_id, and metabolic_rate columns.

    3. Create a function called classify_by_weight that takes a single weight value and returns:

    • “small” if weight < 20g
    • “medium” if weight is 20-50g
    • “large” if weight > 50g
    • “unknown” if weight is missing (NA)

    Use sapply to apply classify_by_weight to the weights vector from (1).

    4. Use mutate, classify_by_weight, and the surveys table to produce a data frame that has data on the year, plot_id, species_id, and weight_class (where weight_class is the output of classify_by_weight). Join this data with the plots table to add information on plot_type. Filter the data to only include data where plot_type is “Control”.

    5. Group the results of (4) based on plot_id and weight_class (using group_by) and count the number of individuals in each group (using summarize).

    6. Create a function called energy_budget() that takes genus, species, and weight as inputs (you’ll need to join the surveys and species tables to get this data together). It should return daily energy needs for each individual in surveys based on the values of genus and species using the following equations:

    • If genus is “Dipodomys” : energy = 0.065 * avg_weight ^ 0.75 * 24
    • If genus is “Chaetodipus” and species is “penicillatus”: energy = 0.080 * avg_weight ^ 0.75 * 24
    • If genus is “Chaetodipus” and species is “baileyi”: energy = 0.26 * avg_weight ^ 0.75 * 24
    • All other species: energy = 0.073 * avg_weight ^ 0.75 * 24

    Run the function with mapply() and the following inputs:

    • genus: c("Dipodomys", "Chaetodipus", "Neotoma")
    • species: c("merriami", "penicillatus", "albigula")
    • weight: c(45, 22, 156)

    7. Use mutate and rowwise to calculate energy budget for each individual in surveys. Drop rows with NA for the new energy_budget column. Group and summarize the data to get an total energy budget for each combination of year, month, and day by summing all of the values of energy_budget in each group.

    Expected outputs for Portal Data Iteration Without Loops
  6. Check That Your Code Runs (10 pts)

    Sometimes you think you’re code runs, but it only actually works because of something else you did previously. To make sure it actually runs you should save your work and then run it in a clean environment.

    Follow these steps in RStudio to make sure your code really runs:

    1. Restart R (see above) by clicking Session in the menu bar and selecting Restart R:

    Screenshot showing clicking session from the menu bar and selecting Restart R

    2. If the Environment tab isn’t empty click on the broom icon to clear it:

    Screenshot showing the Environment tab with the cursor hovering over the broom icon

    The Environment tab should now say “Environment Is Empty”:

    Screenshot showing the Environment tab with only the words Environment Is Empty

    3. Rerun your entire homework assignment using “Source with Echo” to make sure it runs from start to finish and produces the expected results.

    Screenshot showing the RStudio Source with Echo item hovered in the Source dropdown

    4. Make sure that you saved your code with the name assignment somewhere in the file name. You should see the file in the Files tab and the name of the file should be black (not red with an * in the tab at the top of the text editor):

    Screenshot showing the Files tab with the cursor hovering over the assignment file

    Screenshot showing the file name in the editor tab and it is black and there is no *

    5. Make sure that your code will run on other computers

    • No setwd() (use RStudio Projects instead)
    • Use / not \ for paths
    Expected outputs for Check That Your Code Runs
  7. Tree Growth (Challenge - optional)

    The UHURU experiment in Kenya has conducted a survey of Acacia and other tree species in ungulate exclosure treatments. Each of the individuals surveyed were measured for tree height (HEIGHT), circumference (CIRC) and canopy size in two directions (AXIS_1 and AXIS_2). If the file TREE_SURVEYS.txt isn’t already in your working directory, download the data file here.

    Read the data in using the following code:

    tree_data <- read_tsv("TREE_SURVEYS.txt",
                          na = c("dead", "missing", "MISSING",
                                 "NA", "?", "3.3."))
    
    1. Write a function named get_growth() that takes two inputs, a vector of sizes and a vector of years, and calculates the average annual growth rate. Pseudo-code for calculating this rate is (size_in_last_year - size_in_first_year) / (last_year - first_year). Test this function by running get_growth(c(40.2, 42.6, 46.0), c(2020, 2021, 2022)).

    2. Use dplyr and this function to get the growth for each individual tree along with information about the TREATMENT that tree occurs on. Trees are identified by a unique value in the ORIGINAL_TAG column. Don’t include information for cases where a TREATMENT is not known (e.g., where it is NA).

    3. Using ggplot and the output from (2) make a histogram of growth rates for each TREATMENT, which each TREATMENT in it’s own facet. Use geom_vline() to add a vertical line at 0 to help indicate which trees are getting bigger vs. smaller. Include good axis labels.

    4. Create a single function called compare_growth() that combines your work in (2) and (3). It should take the arguments:df (the data frame being used), measure (the column that contains the size measurement to measure growth on; we used CIRC), tag_column (the name of the column with the unique tag; we used ORIGINAL_TAG), sample_column (the name of the column indicating different samples, we used YEAR), and facet_column (the name of the column to use to determine which groups to make histograms for, we used TREATMENT). Use the function to recreate your original plot using compare_growth(tree_data, CIRC, ORIGINAL_TAG, YEAR, TREATMENT). Then use the function to create a similar plot showing growth faceted SPECIES, using SURVEY as the sample_column, and AXIS_1 as the measure by running compare_growth(tree_data, AXIS_1, ORIGINAL_TAG, SURVEY, SPECIES).

    Expected outputs for Tree Growth

Assignment submission & checklist