Ramkrishna Ramkrishna - 1 day ago 4
R Question

Find DIFFERENCE OF TWO COLOUMNS in a CSV and store in THIRD coloumn in CHUNKS

I have a large CSV file(greater than 6GB). Preview of the file is given below:

ID,NUM,MMSRATE,SMSRATE,DATARATE
1,0100000109,623,233,331
2,0200000109,515,413,314
3,0600000109,611,266,662
4,0700000109,729,490,927
5,0800000109,843,637,736
6,0600000109,578,367,875


I want to find the difference of MMSRATE AND SMSRATE and store it in a new coloumn PDRATE in the same csv file. Preview is as given below:

ID,NUM,MMSRATE,SMSRATE,DATARATE,PDRATE
1,0100000109,623,233,333,390
2,0200000109,515,413,314,102


I have about 1 Million rows. So i want to read the rows in chunks of (say 20000), perform the difference operation, then write it into a output CSV file, then read the next 20000 rows , perform operations on that and write it into the output CSV file and so on...

I wrote a code for reading the rows in chunks(of 2 for the case of simplicity), but i am not able to perform the difference of the 2 coloumns within my script. The code is given below:

chunk_size <- 2
con <- file("input.csv", open = "r")
data_frame <- read.csv(con,nrows = chunk_size,quote="",header = TRUE,)
header <- names(data_frame)
print(header)
print(data_frame)
if(nrow(data_frame) == chunk_size) {
repeat {
data_frame <- read.csv(con,nrows = chunk_size, header = FALSE, quote="")
names(data_frame)<-c(header)
print(header)
print(data_frame)
if(nrow(data_frame) < chunk_size) {
break
}
}
}

close(con)


I'm fairly new to Rscript. Any HELP would be much appreciated. I'm running R Studio IDE in Windows.

NOTE: The leading zero in NUM coloumn should be retained in the output CSV FILE. Please note that it is a requirement that i work on CHUNKS of rows from the CSV and not on the CSV file as a whole.

Answer

If I get it right, first create your output file then do the changes and write the result in output, appending at each iteration.

chunk_size <- 2
con  <- file("input.csv", open = "r")
data_frame <- read.csv(con,nrows = chunk_size,quote="",header = TRUE,)
header <- names(data_frame)

outfile="out.csv"
data_frame$PDRATE <- data_frame$MMSRATE - data_frame$SMSRATE
write.csv(data_frame,outfile,row.names=FALSE)

if(nrow(data_frame) == chunk_size) {
 repeat {
   data_frame <- read.csv(con,nrows = chunk_size, header = FALSE, quote="")
   names(data_frame)<-c(header)

   data_frame$PDRATE <- data_frame$MMSRATE - data_frame$SMSRATE
   # note parameters, append=TRUE and col.names=FALSE
   write.table(data_frame,outfile,sep=",",append=TRUE,qmethod="double",col.names=FALSE,row.names=FALSE) 

   if(nrow(data_frame) < chunk_size) {
     break
   }
 }
}

close(con)

Understanding the code above is left as exercise :)

side note after comments: you can't append with write.csv, the documentation state:

Attempts to change append, col.names, sep, dec or qmethod are ignored, with a warning.

you have to use write.table and specify the separator, and qmethod to match write.csv defaults.

Comments