Johnny Johansson - 1 year ago 66
R Question

Summing value in a row of dataframe - execution time

So i have 2 data frames, both of them have the same structure:

``````V1  V2  V3  V4  C
0   1   1   0  -1
0   0   1   0  -1
2   0   0   0   1
2   0   0   0   1
1   0   0   0   1
2   0   0   0   1
``````

The V1-V4 columns are integer type, the C column is factor with 2 levels.
The data frames have different sizes, the first one has ~50 000 rows, the other one has ~600 000 rows. I wrote simple function that divides each element of the row by sum of elements in this row:

``````SimpleFunction <- function(dataset) {
progress.bar <- create_progress_bar("text")
progress.bar\$init(nrow(dataset))
for (i in 1:nrow(dataset)) {
row.sum <- sum(dataset[i,1:4])
dataset[i,1] <- dataset[i,1] / row.sum
dataset[i,2] <- dataset[i,2] / row.sum
dataset[i,3] <- dataset[i,3] / row.sum
dataset[i,4] <- dataset[i,4] / row.sum
progress.bar\$step()
}
return(dataset)
}
``````

Now I tested the times of this function execution with "system.time", and for the 50000 rows data frame it was ~45 sec, but for the 600000 rows data frame it was taking extremely long (around 2 minutes for 1%, I measure it with this simple progress bar from "plyr" package). Now my question is: why? The only thing that has changed is number of rows, the structure of data frame is identical. Shouldn't it be linear growth, like 50000 - 45 sec, 600000 - 540 sec?
I can simply divide the large data frame, run the function on each fragment and then merge them back together, but I really do not understand why is this happening.

``````row.sum <- rowSums(dataset[,1:4])