Johnny Johansson - 6 months ago 32

R Question

So i have 2 data frames, both of them have the same structure:

`V1 V2 V3 V4 C`

0 1 1 0 -1

0 0 1 0 -1

2 0 0 0 1

2 0 0 0 1

1 0 0 0 1

2 0 0 0 1

The V1-V4 columns are integer type, the C column is factor with 2 levels.

The data frames have different sizes, the first one has ~50 000 rows, the other one has ~600 000 rows. I wrote simple function that divides each element of the row by sum of elements in this row:

`SimpleFunction <- function(dataset) {`

progress.bar <- create_progress_bar("text")

progress.bar$init(nrow(dataset))

for (i in 1:nrow(dataset)) {

row.sum <- sum(dataset[i,1:4])

dataset[i,1] <- dataset[i,1] / row.sum

dataset[i,2] <- dataset[i,2] / row.sum

dataset[i,3] <- dataset[i,3] / row.sum

dataset[i,4] <- dataset[i,4] / row.sum

progress.bar$step()

}

return(dataset)

}

Now I tested the times of this function execution with "system.time", and for the 50000 rows data frame it was ~45 sec, but for the 600000 rows data frame it was taking extremely long (around 2 minutes for 1%, I measure it with this simple progress bar from "plyr" package). Now my question is: why? The only thing that has changed is number of rows, the structure of data frame is identical. Shouldn't it be linear growth, like 50000 - 45 sec, 600000 - 540 sec?

I can simply divide the large data frame, run the function on each fragment and then merge them back together, but I really do not understand why is this happening.

Answer

You don't need to use a loop for this R specialises in vectorized computations. All looping through rows does is increase processing time. As such you can do this and R will create a row sum for each row:

```
row.sum <- rowSums(dataset[,1:4])
dataset[,1] <- dataset[,1] / row.sum
dataset[,2] <- dataset[,2] / row.sum
dataset[,3] <- dataset[,3] / row.sum
dataset[,4] <- dataset[,4] / row.sum
```