Johnny Johansson Johnny Johansson - 23 days ago 9
R Question

Summing value in a row of dataframe - execution time

So i have 2 data frames, both of them have the same structure:

V1 V2 V3 V4 C
0 1 1 0 -1
0 0 1 0 -1
2 0 0 0 1
2 0 0 0 1
1 0 0 0 1
2 0 0 0 1


The V1-V4 columns are integer type, the C column is factor with 2 levels.
The data frames have different sizes, the first one has ~50 000 rows, the other one has ~600 000 rows. I wrote simple function that divides each element of the row by sum of elements in this row:

SimpleFunction <- function(dataset) {
progress.bar <- create_progress_bar("text")
progress.bar$init(nrow(dataset))
for (i in 1:nrow(dataset)) {
row.sum <- sum(dataset[i,1:4])
dataset[i,1] <- dataset[i,1] / row.sum
dataset[i,2] <- dataset[i,2] / row.sum
dataset[i,3] <- dataset[i,3] / row.sum
dataset[i,4] <- dataset[i,4] / row.sum
progress.bar$step()
}
return(dataset)
}


Now I tested the times of this function execution with "system.time", and for the 50000 rows data frame it was ~45 sec, but for the 600000 rows data frame it was taking extremely long (around 2 minutes for 1%, I measure it with this simple progress bar from "plyr" package). Now my question is: why? The only thing that has changed is number of rows, the structure of data frame is identical. Shouldn't it be linear growth, like 50000 - 45 sec, 600000 - 540 sec?
I can simply divide the large data frame, run the function on each fragment and then merge them back together, but I really do not understand why is this happening.

Answer

You don't need to use a loop for this R specialises in vectorized computations. All looping through rows does is increase processing time. As such you can do this and R will create a row sum for each row:

row.sum <- rowSums(dataset[,1:4])
dataset[,1] <- dataset[,1] / row.sum
dataset[,2] <- dataset[,2] / row.sum
dataset[,3] <- dataset[,3] / row.sum
dataset[,4] <- dataset[,4] / row.sum  
Comments