First Database in R

The number of simulations I am now running in R consumes all the memory on my workstation causing it to freeze. While I would like to execute the simulations in smaller batches, I need to keep a record of the intermediate data generated which is currently being stored in arrays. One possible solution would be to generate many CSV files to store the intermediate data and read from those but reassembling the data for audit purposes would not be practical. After some excellent suggestions on Stack Overflow, I narrowed my options down to SQLite and MySQL databases with a preference for SQLite due to simplicity. Not being a command line purist, I thought the next step was to find a GUI for setting up the database. While SQLite Database Browser seemed to be the only FOSS GUI available, it requires the Qt library to run and I prefer not to install additional components on my Mac when possible. MySQL does have a FOSS GUI called MySQL Workbench and Thomas Henlich created a plugin for exporting to SQLite but it turns out that SQLite is easy enough to use without a GUI even if you have no prior database experience. The easiest way to get a handle on it is to use a known data set and learn how to send and retrieve it from SQLite using R code. I highly recommend trying to run the database from disk rather than memory since it provides a closer approximation to the problem you are likely to be solving.

Here’s a short example of reading a data set that is currently stored in a csv format, creating a SQLite database, writing the data into a table, and retrieving the information. Make sure that the RSQLite package is installed and active.

# Read data
dataset=read.csv("/yourdirectory/yourfile.csv",skip=1,header=TRUE)
# Create DB
testdb=dbConnect(SQLite(), dbname = "/yourdirectory/testdb.dat")
# Write the data set into a table called table1
dbWriteTable(testdb,"table1",dataset)
# List the tables available in the database
dbListTables(testdb)
# Remove the original data to free up memory
rm(dataset)
# Retrieve the entire table
dbReadTable(testdb,"table1")
# Retrieve the single column called series from the table
buf<-dbSendQuery(testdb,"select series from table1")
testing<-fetch(buf,n=-1); dbClearResult(buf)
dbDisconnect(testdb); rm(buf,testdb)

SQL queries generate a result that is stored in a buffer. Fetch gets data from the buffer and directs it to the variable within R. Before doing another SQL query on the same data, you need to clear the buffer using dbClearResult. This process gets to be a pain when making many queries but the SQLiteDF package has an excellent method for treating SQLite database tables as data frames. I also recommend reading about the TSSQLite package if you are using time series data and work with zoo objects.

Author: Probable Pattern

Former Marine and Curious Critter