R – Stephen Rush

Apple Silicon M1: R, radian, python, and RStudio

I recently bought a MacBook Air with the M1 processor and I have had difficulties in rebuilding my coding environment. This post will focus on installing R for Arm64, radian, and RStudio Desktop. My last Mac adventure over 20 years ago had me downloading and installing everything by hand. I have since adopted scripts to automate (and document) my setup. The key to everything is switching to homebrew for installation. This provides the native Arm64 version R which can be used with radian installed through Arm64 conda and allows R to work with the native Arm64 version of RStudio.

Install Xcode from the App store then run:
xcode-select --install
in the default zsh on Mac. I like to use the terminal in VS Code but you can just open the Terminal application that comes on Mac located in the Utilities folder within Applications.

Install brew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" echo 'export PATH="/opt/homebrew/bin:$PATH"' >> ~/.zshrc source ~/.zshrc brew update && brew upgrade

Install R for Arm64
brew install r echo 'alias R="/opt/homebrew/opt/r/bin/R --vanilla"' >> ~/.zshrc
I added this alias because I have never met anyone that wanted to save their workspace.

Install Anaconda/Miniconda3/Miniforge
It took me a while to realize that all the cool kids use conda-forge as their default repository. Miniforge makes this easier.
brew install miniforge conda init zsh conda update -y && conda -y --all

At this point, I got stuck. As of this writing, Anaconda/Miniconda did not have an Arm64 installer for macOS so I would need to run in x86 emulation which would not be compatible with Arm64 R and RStudio. Even using the Miniforge Arm64 install, radian 5.11 is listed on conda-forge but is does not install. This is most likely due to the conda-forge Arm64 repository being quite new. The trick is to create a new python environment (in an Arm64 python distribution) and install via pip3.
conda create --name R conda activate R pip3 install radian echo 'alias r="radian"' >> ~/.zshrc && source ~/.zshrc

You should now have python, R, and radian all running on Arm64. I use VS Code on my own but I teach in RStudio. You will need to use the RStudio nightly builds to get Arm64 support. I am sure a stable version will be released in the near future. For those who are interested in VS Code, the vscode-R extension should now work with everything above (although I still haven’t gotten the session watcher to work properly).

Thank you to all the open source folks that make this software freely available. Happy coding!

Ordinary Least Squares

Let’s estimate a stock’s beta and alpha using the CAPM. First prepare the data by calculating stock returns and market returns. Subtract the risk-free rate from both and form a data.frame called “urdata” where ExRet is the excess return on the stock and MktRP is the risk premium on the market. Next, specify the formula for the estimated model and call it “m2”. Use the lm() function to estimate the model and assign the results to a variable called “ols”. Finally, use summary() to see the regression results.

m2=ExRet~MktRP
ols=lm(m2,data=urdata)
summary(ols)

If you are using a lot of fixed effects, try using coef(ols)[1:5,] to get the first 5 rows of coefficients.

Pastor and Stambaugh Liquidity

While there a many measures of liquidity, the metric in Pastor and Stambaugh (2003) is the standard used in empirical asset pricing. The first thing we need is daily data. Academics can use the CRSP data set but the idea can be replicated with free data. You will need Date, unique stock identifier (I use CUSIP but ticker works just as well), Volume, Return (make sure to include dividends), and the market return (Here I use the CRSP value weighted index but any broad value weighted index should work). Assuming you’ve already formatted the date properly and the data so everything is numeric and not factor or string, we start with a panel I’ve named ‘crsp’. I also recommend setting up a parallel backend to make use of all your CPU cores. Here I use the parallel package but if you have multiple machines be sure to check out doRedis for a relatively simple way to make the poor man’s computer cluster.

require(data.table); require(doParallel);
require(parallel); require(foreach);
registerDoParallel(detectCores()-1) # Keep 1 CPU core for the OS
crsp=fread(paste(path,'Data_Sets/WRDS_CRSP_LiquidityCalc_1993-2013.csv',sep=''),header=TRUE)
setnames(crsp,c('PERMNO','Date','Ticker','CUSIP','Price','Volume','Return','Shares','MktReturn'))
crsp=crsp[Price>5]; crsp[,MktCap:=1000*Shares*Price];
crsp=crsp[MktCap>1000000]
crsp[,Date:=as.Date(as.character(Date),format='%Y%m%d')];
crsp[,Month:=as.yearmon(Date)]; crsp[,Return:=as.numeric(Return)]
#### Create Metrics ##############################################
crsp[,Ret.e:=Return-MktReturn];
crsp[,SVolume:=Volume*(Ret.e/abs(Ret.e))]
crsp[,L.Ret.e:=c(NA,head(Ret.e,-1)),by='CUSIP']
crsp[,L.SVolume:=c(NA,head(SVolume,-1)),by='CUSIP']
crsp=crsp[,list(Date,Month,CUSIP,Ret.e,L.Ret.e,L.SVolume)]
#### Eliminate months with insufficient observations #############
crsp[,Obs:=length(na.omit(L.Ret.e)),by=c("CUSIP","Month")]
crsp=crsp[Obs>10]
#### Run model on daily data for each month ######################
model=Ret.e~L.Ret.e+L.SVolume
gammas=foreach(cusip=unique(crsp[,CUSIP]),.combine='rbind') %dopar% {
 stock=crsp[CUSIP==cusip];
 stock[,Gamma:=coef(lm(model,data=.SD))[3],by=c('Month')]
 stock=unique(stock[,list(Month,CUSIP,Gamma)]); return(stock)
}
save(gammas,file=paste(path,'Panels/Liquidity_Panel.RData',sep=''))

VPIN in R

This code is an implementation of Volume Synchronized Probability of Informed Trading by Easley, Lopez de Prado, and O’Hara (2012) published in the Review of Financial Studies. Easley et al. argue that the CDF of VPIN indicates order flow toxicity. This is their explanation of the flash crash in 2010. You can see slides to my R/Finance 2014 presentation on the topic.

This version of the code is not particularly fast and they are plenty of opportunities for a better programmer than me to tune it up for speed.

#### VPIN calculation #########################################################
#install.packages('fasttime',repos='http://www.rforge.net/')
require(data.table); require(fasttime); require(plyr)
# Assuming TAQ data is arranged in 1 year stock csv files
stock=fread('/TAQ_data.csv'); stock=stock[,1:3,with=FALSE]
setnames(stock,colnames(stock),c('DateTime','Price','Volume'));
stock[,DateTime:=paste(paste(substr(DateTime,1,4),substr(DateTime,5,6),
    substr(DateTime,7,8),sep='-'),substr(DateTime,10,17))]
setkey(stock,DateTime);
stock[,DateTime:=fastPOSIXct(DateTime,tz='GMT')]
stock=as.xts(stock)
# Now we have an xts data frame called 'stock' with a DateTime index and... 
# two columns: Price and Volume
# Vbucket=Number of volume buckets in an average volume day (Vbucket=50)
VPIN=function(stock,Vbucket) {
  stock$dP1=diff(stock[,'Price'],lag=1,diff=1,na.pad=TRUE)
  ends=endpoints(stock,'minutes')
  timeDF=period.apply(stock[,'dP1'],INDEX=ends,FUN=sum)
  timeDF$Volume=period.apply(stock[,'Volume'],INDEX=ends,FUN=sum)
  Vbar=mean(period.apply(timeDF[,'Volume'],INDEX=endpoints(timeDF,'days'),
    FUN=sum))/Vbucket
  timeDF$Vfrac=timeDF[,'Volume']/Vbar
  timeDF$CumVfrac=cumsum(timeDF[,'Vfrac'])
  timeDF$Next=(timeDF[,'CumVfrac']-floor(timeDF[,'CumVfrac']))/timeDF[,'Vfrac']
  timeDF[timeDF[,'Next']<1,'Next']=0
  timeDF$Previous=lag(timeDF[,'dP1'])*lag(timeDF[,'Next'])
  timeDF$dP2=(1-timeDF[,'Next'])*timeDF[,'dP1'] + timeDF[,'Previous']
  timeDF$Vtick=floor(timeDF[,'CumVfrac'])
  timeDF[,'Vtick']=timeDF[,'Vtick']-diff(timeDF[,'Vtick']); timeDF[1,'Vtick']=0
  timeDF=as.data.frame(timeDF); timeDF[,'DateTime']=row.names(timeDF)
  timeDF=ddply(as.data.frame(timeDF),.(Vtick),last)
  timeDF=as.xts(timeDF[,c('Volume','dP2','Vtick')],
    order.by=fastPOSIXct(timeDF$DateTime,tz='GMT'))
  timeDF[1,'dP2']=0
  timeDF$sigma=rollapply(timeDF[,'dP2'],Vbucket,sd,fill=NA)
  timeDF$sigma=na.fill(timeDF$sigma,"extend")
  timeDF$Vbuy=Vbar*pnorm(timeDF[,'dP2']/timeDF[,'sigma'])
  timeDF$Vsell=Vbar-timeDF[,'Vbuy']
  timeDF$OI=abs(timeDF[,'Vsell']-timeDF[,'Vbuy'])
  timeDF$VPIN=rollapply(timeDF[,'OI'],Vbucket,sum)/(Vbar*Vbucket)
  timeDF=timeDF[,c('VPIN')]; return(timeDF)
}
out=VPIN(stock,50)
###############################################################################

Here is what the original file looks like:

1993-01-04 09:35:25,10.375,5300,40,0,,N
1993-01-04 09:36:49,10.375,25000,40,0,,N
1993-01-04 09:53:06,10.375,100,40,0,,N
1993-01-04 10:04:13,10.375,200,40,0,,N
1993-01-04 10:04:20,10.375,100,40,0,,N
1993-01-04 10:24:42,10.375,1000,40,0,,N
1993-01-04 10:25:19,10.375,600,40,0,,N
1993-01-04 11:31:04,10.5,10000,40,0,,N
1993-01-04 12:13:09,10.5,200,0,0,,M
1993-01-04 12:13:38,10.5,200,0,0,,M

Emacs and ESS on OS X for Beginners

When I first started learning R, I used R Commander as my GUI since it worked with all my operating systems and was relatively easy to setup and use. After a few months, I came to appreciate the simplicity of the RGui that is standard on R installation. Now, I am ready for something more advanced and have decided to try ESS (Emacs Speaks Statistics). Since I do not have any prior experience with Emacs and have not invested any time creating configuration settings and learning keyboard shortcuts, my main goal was to get a very Mac friendly Emacs version that just worked. OS X does come with Emacs installed but I didn’t want to go through the hassle of setting up ESS with the existing EMacs. I decided to use Aquamacs because it looks and behaves the most like a modern OS X application and has ESS enabled when installed. Emacs is available through the terminal but setting up and using ESS without menus does not provide an easier working environment than the RGUI or R Commander.

Emacs uses what are called meta commands since it originated before toolbars. To execute a meta command, type <Ctrl> u, <alt> x and the actual command. On Aquamacs, the right option key is set up to be the meta key so you can simply press <option> x before the actual command. You can also use the <esc> key as the meta key. To start R, press <option> x R <return> or <esc> x R <return>. You can also open an existing R script and click the R icon to start R. To customize ESS, select Customize Aquamacs -> Customize Group… from the Options menu and type ess at the prompt. You can also click the settings icon only after you have opened an R script. I like to have ESS start from the the same directory every time so I add my directory to the “ESS Directory” option and turn off the “ESS Ask for ESS Directory” option. Make sure to “Save for future sessions” before closing the buffer. I had already deleted R32 so I know that Emacs will default to R64 but I’m not sure how to work with both versions.

Update: I have now switched to RStudio since it is free, cross platform compatible, and simple to setup. I highly recommend it to any R user.

Two-Way Sorting

One of the popular ways to motivate a two dimensional model is to double sort your data based on quantiles. This makes empirical asset pricing models exceedingly easy to understand. If I had already defined the breakpoints of the sort, I could just use a function like “plyr” to apply a function like “mean” to the data subsets. For my application, I want to create five portfolios based on one characteristic and then form five portfolio within each of the first five based on another characteristic. To generalize from a 5×5 sort, I include the parameter q to allow for quarters or deciles. The x and y parameters are the column names of the characteristics you want to sort by. As you can see from the last line, I am sorting by CDS premium and liquidity premium. I use latex for my publication needs so I output the dataframe to xtable. I also use the awesome foreach package to speed things up a tiny bit but if you don’t use a parallel backend with foreach, then just replace “%dopar%” with “%do%”. Happy sorting!

require(foreach); require(xtable)
doublesort=function(data,x,y,q) {
 breaks1=quantile(data[,x],probs=seq(0,1,1/q))
 tiles=foreach(i=1:q,.combine=cbind) %dopar% {
 Q=data[data[,x]>=breaks1[i] & data[,x]<breaks1[i+1],y]
 breaks2=quantile(Q,probs=seq(0,1,1/q))
 foreach(j=1:q,.combine=rbind) %dopar%
  mean(Q[Q>=breaks2[j] & Q<breaks2[j+1]])
 }
 colnames(tiles)=foreach(i=1:q) %do% paste(x,i)
 row.names(tiles)=foreach(i=1:q) %do% paste(y,i)
 tiles
}
xtable(doublesort(data,"CDS","lp",5),digits=4)

Dashboards

Precious Metals Dashboard
This R script produces charts of Gold, Platinum, Palladium, and Silver.

# Charts Gold prices in USD from Oanda
require(Quantmod)
#fetch data
getSymbols(c("XAU/USD","XPT/USD","XPD/USD","XAG/USD"),src="oanda")
#setup the plot area, each TA is another graph
layout(matrix(1:4, nrow=2))
#charts to be included
chartSeries(XAUUSD,name="Gold (.oz) in USD", layout=NULL)
chartSeries(XPTUSD,name="Platinum (.oz) in USD", layout=NULL)
chartSeries(XPDUSD,name="Palladium (.oz) in USD", layout=NULL)
chartSeries(XAGUSD,name="Silver (.oz) in USD", layout=NULL)

Interest Rate Dashboard
Another R script that charts the last 36 months of the following interest rates: 1 year U.S. Treasury, 10 year TIPS, 30 year fixed mortgage, Prime, Moody’s Aaa Corporate, and Moody’s Baa Corporate.

library(quantmod)
getSymbols("DGS1",src="FRED") #1 Year US Treasury Daily Constant Maturity rate
getSymbols("DPRIME",src="FRED") #Daily Prime Loan Rate
getSymbols("DAAA",src="FRED") #Daily Moody's Aaa Corporate Bond Yield
getSymbols("DBAA",src="FRED") #Daily Moody's Baa Corporate Bond Yield
getSymbols("MORTG",src="FRED") #30 Year Conventional Mortgage Rate
getSymbols("FII10",src="FRED") #10 Year TIPS Constant Maturity
# Chart Data
layout(matrix(1:6, nrow=3))
chartSeries(last(DGS1,'36 months'),name="US Treasury 1 Year Constant Maturity",layout=NULL)
chartSeries(last(FII10,'36 months'),name="10 Year TIPS Constant Maturity",layout=NULL)
chartSeries(last(MORTG,'36 months'),name="30 Year Fixed Mortgage Rate",layout=NULL)
chartSeries(last(DPRIME,'36 months'),name="Prime Loan Rate",layout=NULL)
chartSeries(last(DAAA,'36 months'),name="Moody's Aaa Corporate Yield",layout=NULL)
chartSeries(last(DBAA,'36 months'),name="Moody's Baa Corporate Yield",layout=NULL)

Rolling Correlations
Chart rolling correlations with a 3 month window.

library(quantmod)
getSymbols(c("SPY","^DJI"),src="yahoo")
data=data.frame(SPY[,6],DJI[,6])
data=as.xts(data,order.by=as.Date(row.names(data),"%Y-%m-%d"))
c1=rollapply(data,65,cor,by.column=F)
Correlation=c1[,2]; chartSeries(Correlation)

First Database in R

The number of simulations I am now running in R consumes all the memory on my workstation causing it to freeze. While I would like to execute the simulations in smaller batches, I need to keep a record of the intermediate data generated which is currently being stored in arrays. One possible solution would be to generate many CSV files to store the intermediate data and read from those but reassembling the data for audit purposes would not be practical. After some excellent suggestions on Stack Overflow, I narrowed my options down to SQLite and MySQL databases with a preference for SQLite due to simplicity. Not being a command line purist, I thought the next step was to find a GUI for setting up the database. While SQLite Database Browser seemed to be the only FOSS GUI available, it requires the Qt library to run and I prefer not to install additional components on my Mac when possible. MySQL does have a FOSS GUI called MySQL Workbench and Thomas Henlich created a plugin for exporting to SQLite but it turns out that SQLite is easy enough to use without a GUI even if you have no prior database experience. The easiest way to get a handle on it is to use a known data set and learn how to send and retrieve it from SQLite using R code. I highly recommend trying to run the database from disk rather than memory since it provides a closer approximation to the problem you are likely to be solving.

Here’s a short example of reading a data set that is currently stored in a csv format, creating a SQLite database, writing the data into a table, and retrieving the information. Make sure that the RSQLite package is installed and active.

# Read data
dataset=read.csv("/yourdirectory/yourfile.csv",skip=1,header=TRUE)
# Create DB
testdb=dbConnect(SQLite(), dbname = "/yourdirectory/testdb.dat")
# Write the data set into a table called table1
dbWriteTable(testdb,"table1",dataset)
# List the tables available in the database
dbListTables(testdb)
# Remove the original data to free up memory
rm(dataset)
# Retrieve the entire table
dbReadTable(testdb,"table1")
# Retrieve the single column called series from the table
buf<-dbSendQuery(testdb,"select series from table1")
testing<-fetch(buf,n=-1); dbClearResult(buf)
dbDisconnect(testdb); rm(buf,testdb)

SQL queries generate a result that is stored in a buffer. Fetch gets data from the buffer and directs it to the variable within R. Before doing another SQL query on the same data, you need to clear the buffer using dbClearResult. This process gets to be a pain when making many queries but the SQLiteDF package has an excellent method for treating SQLite database tables as data frames. I also recommend reading about the TSSQLite package if you are using time series data and work with zoo objects.

Introduction to Bootstrapping with R

Bootstrapping is a process where your original data is re-sampled to produce a distribution of a single statistic such as the mean. The Wikipedia article has more information here. Since time-series data often have some amount of dependency from one period to the next, we re-sample the original data in blocks. Assuming your data set has 100 observations and the block size is 6, we start at a random observation and sample the next 6 observations. We start at another random observation and sample another consecutive 6 observations repeating this process until we have 100 observations which is called a replicate. (When n/block size is not an integer, we get a shorter block at the end.) The statistic (in this case the mean) is calculated for each replicate. Multiple replicates are used to construct a distribution of the statistic so we can now draw inferences from the data. Since the function tsboot only returns the result of bootstrapping the statistic, you will need to find the actual data in each replicate in order for others to reproduce your results. The original data is not stored in the output but the index of the original data can be obtained using the boot.array command. Before you use your newly generated data, you will want to check whether the settings you used such as the number of replicates and the block size are adequate to capture the statistic of interest. A simple plot of the tsboot output will do the trick. For a more in-depth analysis of the proper block size, check out the b.star function in the np package and research by Poltis and White (2004). When you tire of calculating basic statistics, tsboot will accept a custom function that can be used instead of a command such as “mean” to calculate the statistic. The December 2002 edition of R News has a good overview of the boot package.

require(boot)
# Simple example of bootstrapping mean from 100 random numbers
x <- rnorm(100)
y <- tsboot(x, mean, R=10, l=20, sim=”fixed”)
# histogram of R number of statistic (index = t of interest from tsboot output)
plot(y, index=1)
# matrix of the data index (column) occurring in bootstrap replicate (row)
y.array <- boot.array(y)
# Write to CSV
write.csv(y.array,”/yourdirectory/yourfile1.csv”)
write.csv(x,”/yourdirectory/yourfile2.csv”)

Timing csv File Reading in R and Python

I am currently working with trade data organized in large csv files. I took this as an opportunity to learn the Pandas package in Python mostly for the HDF5 integration. As a sanity check, I decided to time three methods: read.csv in R base, fread in the data.table package in R, and read.csv in the Pandas package in Python. Let me first say that I am not a programmer. This test is not a definitive benchmark of the three methods. It is a benchmark of the “out-of-the-box” functionality available to the non-programmer trying to use code to get shit done. I used time.time() in Python and proc.time() in R to read 10,000,000 rows of mixed data types. Without further ado, here are the results: read.csv in R took 131.45 seconds, read.csv in Pandas took 27.741 seconds, and the winner was fread in the data.table package clocking in at 14.018 seconds. Unfortunately, fread needs more TLC in determining data types and missing values. Pandas just works right the first time. I also prefer the time manipulation in Pandas compared to the XTS package in R but not enough to deal with two languages in one project.