windsound windsound - 2 months ago 36
SQL Question

How to deal with a 50GB large csv file in r language?

I am relatively new in the "large data process" in r here, hope to look for some advise about how to deal with 50 GB csv file. The current problem is following:

Table is looked like:

ID,Address,City,States,... (50 more fields of characteristics of a house)
1,1,1st street,Chicago,IL,...
# the first 1 is caused by write.csv, they created an index raw in the file


I would like to find all rows that is belonging San Francisco, CA. It supposed to be an easy problem, but the csv is too large.

I know I have two ways of doing it in R and another way to use database to handle it:

(1) Using R's ffdf packages:

since last time the file is saved, it was using write.csv and it contains all different types.

all <- read.csv.ffdf(
file="<path of large file>",
sep = ",",
header=TRUE,
VERBOSE=TRUE,
first.rows=10000,
next.rows=50000,
)


the console gives me this:

Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered,
: vmode 'character' not implemented


Searching through online, I found several answers which did not fit my case, and I can't really make sense of how to transfer "character" into "factor" type as they mentioned.

Then I tried using read.table.ffdf, this is even more disaster. I can't find a solid guide for that one.

(2) Using R's readline:

I know this is another good way, but can't find an effecient way to do this.

(3) Using SQL:

I am not sure how to transfer the file into SQL version, and how to handle this, if there is a good guide I would like to try. But in general, I would like to stick with R.

Thanks for reply and help!

Answer

You can use R with SQLite behind the curtains with the sqldf package. You'd use the read.csv.sql function in the sqldf package and then you can query the data however you want to obtain the smaller data frame.

The example from the docs:

library(sqldf)

iris2 <- read.csv.sql("iris.csv", 
    sql = "select * from file where Species = 'setosa' ")

I've used this library on VERY large CSV files with good results.

Comments