Eric Chang Eric Chang - 2 months ago 38
R Question

How to use fread() as readLines() without auto column detection?

I have a 5Gb .dat file (> 10million lines). The format of each line is like

aaaa bb cccc0123 xxx kkkkkkkkkkkkkk
or
aaaaabbbcccc01234xxxkkkkkkkkkkkkkk
for example. Because
readLines
has poor performance while reading big file, I choose
fread()
to read this, but error was occurred:

library("data.table")
x <- fread("test.DAT")
Error in fread("test.DAT") :
Expecting 5 cols, but line 5 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=' ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.
In addition: Warning message:
In fread("test.DAT") :
Unable to find 5 lines with expected number of columns (+ middle)


How to use
fread()
as
readLines()
without auto column detecting? Or is there any other way to solve this problem?

Answer

Here's a trick. You could use a sep value that you know is not in the file. Doing that forces fread() to read the whole line as a single column. Then we can drop that column to an atomic vector (shown as [[1L]] below). Here's an example on a csv where I use ? as the sep. This way it acts similar to readLines(), only a lot faster.

f <- fread("Batting.csv", sep= "?", header = FALSE)[[1L]]
head(f)
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"       
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"  
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,," 
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"

Other uncommon characters you can try in sep are \ ^ @ # = and others. We can see that this will produce the same output as readLines(). It's just a matter of finding a sep value that is not present in the file.

head(readLines("Batting.csv"))
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"                                  
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"                             
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,,"                            
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"                           
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"