Timing csv File Reading in R and Python

I am currently working with trade data organized in large csv files. I took this as an opportunity to learn the Pandas package in Python mostly for the HDF5 integration. As a sanity check, I decided to time three methods: read.csv in R base, fread in the data.table package in R, and read.csv in the Pandas package in Python. Let me first say that I am not a programmer. This test is not a definitive benchmark of the three methods. It is a benchmark of the “out-of-the-box” functionality available to the non-programmer trying to use code to get shit done. I used time.time() in Python and proc.time() in R to read 10,000,000 rows of mixed data types. Without further ado, here are the results: read.csv in R took 131.45 seconds, read.csv in Pandas took 27.741 seconds, and the winner was fread in the data.table package clocking in at 14.018 seconds. Unfortunately, fread needs more TLC in determining data types and missing values. Pandas just works right the first time. I also prefer the time manipulation in Pandas compared to the XTS package in R but not enough to deal with two languages in one project.

Author: Probable Pattern

Former Marine and Curious Critter

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s