I have very large tables (30 million rows) that I would like to load as a dataframes in R.
datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))
df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))
An update, several years later
This answer is old, and R has moved on. Tweaking
read.table to run a bit faster has precious little benefit. Your options are:
readr (on CRAN from April 2015). This works much like
fread above. The README in the link explains the difference between the two functions (
readr currently claims to be "1.5-2x slower" than
Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.)
read.csv.sql in the
sqldf package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: the
RODBC package, and the reverse depends section of the
DBI package page.
MonetDB.R gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its
dplyr allows you to work directly with data stored in several types of database.
The original answer
There are a couple of simple things to try, whether you use read.table or scan.
nrows=the number of records in your data (
Make sure that
comment.char="" to turn off interpretation of comments.
Explicitly define the classes of each column using
multi.line=FALSE may also improve performance in scan.
If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of
read.table based on the results.
The other alternative is filtering your data before you read it into R.
Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with
saveRDS, then next time you can retrieve it faster with