- Download
portal_mammals.sqlite.- Make sure the copy you are going to use in class does not have the
SpeciesCountstable or view.
- We can work with data in databases directly from
R - No need to export files from the database
- Lets the database do the heavy lifting
- Faster
- No RAM limits
- We can do this either directly using SQL and the
DBIpackage - Or work with the data just like we’ve done before using
dplyr
Installation
DBIfor general database functionalityRSQLitefor translating between R and SQLitedbplyrfor integrating databases withdplyr
install.packages(c("DBI", "dbplyr", "RSQLite"))
Connect
library(DBI)
library(RSQLite)
library(dplyr)
portaldb <- dbConnect(SQLite(), "portal.sqlite")
- It’s most common for this code to be written as
portaldb <- dbConnect(RSQLite::SQLite(), "portal.sqlite")
portaldb
To avoid needing to load the RSQLite package using library()
Check out database structure
- Once connected to a database we can list the tables
dbListTables(portaldb)
- We can also look at the details of individual tables
dbListFields(portaldb, "plots")
Connecting to tables
- We can also connect to individual tables
surveys <- tbl(portaldb, "surveys")
surveys
- The data is still in the database, not in
R, so we can’t tell how many rows it has or view it in the same way we would view a data frame. - If we want to load all of the data from the table into an
Rdata frame we use thecollect()function
surveys_df <- collect(surveys)
Write a query
- We can interact with the data in the database in two ways
- First, we can write queries in SQL
- Write a query to extract counts for each
species_id
count_query <- "SELECT species_id, COUNT(*)
FROM surveys
GROUP BY species_id"
dbGetQuery(portaldb, count_query)
-
This uses
DBIto run the query and return it to the R as a data frame -
Alternatively we can use the
tbl()function from dplyr to create a table based on the query
tbl(portaldb, sql(count_query))
- The table is still stored in the database
- Number of rows is unknown as shown by
?? - When we have the results we want we can use
collect()to load them intoRas a data frame
count_data <- tbl(portaldb, sql(count_query)) %>%
collect()
Using dplyr pipelines with databases
- We can also use
dplyrcommands directly on databases - To obtain the same results as our previous query using dplyr we use
group_by()andsummarize()
species_counts <- surveys %>%
group_by(species_id) %>%
summarize(count = n())
- All of the calculation still happens in the databases
- So outside of RAM calculations are possible
- We can then bring the resulting data into
Rusingcollect()for further analysis
species_counts <- surveys %>%
group_by(species_id) %>%
summarize(count = n()) %>%
collect()
Write new information to database
- Can also move data we created in R into the database using
copy_to() - We can see that the current version of the database only has the three original tables
dbListTables(portaldb)
- If we wanted to store our new
species_countstable in the database
copy_to(portaldb, species_counts, temporary=FALSE,
name="SpeciesCounts")
dbListTables(portaldb)
- Do Copy to Database.