skan skan - 4 months ago 28
R Question

R data.table: Counting rows with fread without reading the whole file

I want to use data.table to process a very big file.
It doesn't fit on memory.
I've thought on reading the file on chunks using a loop with (increasing properly the skip parameter).

fread("myfile.csv", skip=loopindex, nrows=chunksize)

processing each of this chunks and appending the resulting output with fwrite.

In order to do it properly I need to know the total number of rows, without reading the whole file.

What's the proper/faster way to do it?

I can ony think in reading only the first column but maybe there is an special command or trick.
or maybe there is an automatic way to detect the end of the file...



1) count.fields Not sure if count.fields reads the whole file into R at once. Try it to see if it works.

length(count.fields("myfile.csv", sep = ","))

If the file has a header subtract one from the above.

2) sqldf Another possibility is:

read.csv.sql("myfile.csv", sep = ",", sql = "select count(*) from file")

You may need other arguments as well depending on header, etc. Note that this does not read the file into R.

3) wc Use the system command wc which should be available on all platforms that R runs on.

shell("wc -l myfile.csv")

Again, if there is a header subtract one.

If you are on Windows be sure that Rtools is installed and:

shell("C:\\Rtools\\bin\\wc -l myfile.csv")

Alternately on Windows without Rtools try this:

shell('find /v /c "" myfile.csv')

See How to count no of lines in text file and store the value into a variable using batch script?