Learning Objectives


Introduction

Interacting with databases through scripted languages can offer advantages over querying databases via a GUI interface. A GUI interface for your database is easier to use and allows the quick viewing of adhoc queries. Querying a database with a programatic interface (in this case R, but it could be any language) are slightly more complicated. However the trade-off is that data manipulations are preserved in the code. Aggregations, summaries and other database operations are preserved. Therefore those pre-analysis data manipulation steps are not lost and can be reproduced later by yourself or others.

Connecting R to sqlite databases

R can connect to databases through a number of packages. In our case we we will use RSQLite to connect to existing SQLite3 databases. However you should be able to connect on almost any database in R via JDBChttp://cran.r-project.org/web/packages/RJDBC/index.html or ODBC, or specific database packages (such as we are doing, or MySQL ).

To begin these exercises we’ll connect to the mammal database you’ve just created.

library(RSQLite)
#> Loading required package: DBI
#> Loading required package: methods
## Set dbname and driver out of convenience
myDB <- "data/portal_mammals.sqlite"
conn <- dbConnect(drv = SQLite(), dbname= myDB)

Now you’re connected to your database, you can get information about the tables in the database. While SQLite manager will provide this for you, you’ll need to get it yourself with some basic SQL commands

Basic queries

You can grab information about all tables like this:

dbGetQuery(conn, "SELECT type, tbl_name  FROM sqlite_master")

RSQLite has some functions that easily list basic information about your database and tables. Here you can see the types and names of fields and get a count of records.

dbListTables(conn)
dbListFields(conn, "surveys")
dbGetQuery(conn, "SELECT count(*) FROM surveys")

Great! That’s all there is to it. You are now connected to your database and know something about the tables in it. From here you can manipulate it the exact same way you can from SQLite Manager except in a scripted manner. Let’s try some basic queries from the previous lesson. Querying simply takes a connection to a database and query as inputs and returns a dataframe with the results.

q <- 'SELECT DISTINCT year, species_id FROM surveys'
result <-  dbGetQuery(conn, q)
head(result)

Challenge

Write a query that gets counts of genus by plot type. If this seems to advanced, then get counts of genus by plot_id (eliminates one of the >joins)

Answer

q <- "SELECT d.plot_type , c.genus, count(*)
FROM
(SELECT a.genus, b.plot
FROM species a
JOIN surveys b
ON a.species_id = b.species_id) c
JOIN plots d
ON c.plot = d.plot_id
GROUP BY d.plot_type,c.genus"

result <- dbGetQuery(conn,q)
head(result)

In the above answer you’ll note that I use a common technique in SQL queries called aliasing. Aliasing is simply a way to reference a table or sub query, usually within a join. In the above answer aliases allow an easy way to give a shorthand name to a table, e.g. plots d. Sub queries also require an alias if you want to use them in a join.

All we’ve done so far is execute the same sorts of queries that can easily be made with a GUI. Now let’s try leveraging the power of scripted queries. Imagine you want to know how many rodents were found every other year. To get this we’ll get the range of dates from the database, sequence them by two and make new queries.

yearRange <- dbGetQuery(conn,"SELECT min(year),max(year) FROM surveys")
years <- seq(yearRange[,1],yearRange[,2],by=2)

Next we’ll build our query string using the paste() function.

q <- paste("
SELECT a.year,b.taxa,count(*) as count
FROM surveys a
JOIN species b
ON a.species_id = b.species_id
AND b.taxa = 'Rodent'
AND a.year in (",
paste(years,collapse=",")
,")
GROUP BY a.year, b.taxa",
sep = "" )
rCount <- dbGetQuery(conn,q)
head(rCount)

With the nested paste commands we were able to construct a query programatically, without having to type out all the years. This could also be done with a for loop, especially if the query to be constructed is more complicated.

Building your workflow

Up until now we been working with an existing data. However we can use R as a way to build up databases from existing flat files. We’ll use the flat files that make up the mammals database to recreate it. First let’s read in the files.

species <- read.csv("data/species.csv")
surveys <- read.csv("data/surveys.csv")
plots <- read.csv("data/plots.csv")

We have some data now. Next we’ll need to create a database and add our tables.

myDB <- "portalR.db"
myConn <- dbConnect(drv = SQLite(), dbname= myDB)
dbListTables(myConn)

By opening up a connection, you’ve created a new database. If you list the tables you’ll see that the database is empty. Next we’ll add the dataframes we just read in to the database.

dbWriteTable(myConn,"species",species)
dbListTables(myConn)
dbGetQuery(myConn,"SELECT * from species limit 10")

If you check the location of your database you’ll see that data is automatically being written to disk. Not only does R and RSQLite provide easy ways to query existing databases, it also allows you to easily create your own databases from flat files.

Challenge

Add the remaining tables to the existing database. Query your database from SQLite manager just to verify that you’re adding data.

Extra Challenge

Run some of your queries from earlier in the lesson to verify that you have faithfully recreated the mammals database.