sven b sven b - 4 months ago 7
R Question

Performance Issue while operating among two columns of a dataframe

Given a dataframe with two columns:


  • length (length of elements)

  • findLengthOf (This is a string of values) The index of the elements for which the length is needed



So one has to find all the length of all indexes in the second column and put the result in a third column.
Please see above example, where we search for the lenght of 1637 and obtain 1835:

> df$length[1637]
[1] 1835


head(df)
length findLengthOf
1 6434 1637,386....
2 4272 4322,414....
3 7338 2052,639....
4 4932 190,1567....
5 2397 8963,844....
6 4405 103,4346....

head(df)
length findLengthOf result
1 6434 1637,386.... 1835, 2404, 4689
2 4272 4322,414.... 1184, 2721, 7215
3 7338 2052,639.... 5253, 2998, 6153
4 4932 190,1567.... 2931, 6496, 7784
5 2397 8963,844.... 3796, 3488, 6555
6 4405 103,4346.... 1662, 5481, 1244

set.seed(123)
df <- data.frame(length = sample(1e4),
findLengthOf = I(replicate(1e4, paste(sample(1:10000,1),sample(1:10000,1),sample(1:10000,1),sep=","), simplify = FALSE)))

df$result=lapply(lapply(df$findLengthOf,strsplit,split=","), function(x){df[x[[1]],"length"]})


Code works, but it takes to long. How can I improve the speed?
Also why does

head(lapply(df$findLengthOf,strsplit,split=","))


always return this weird list of lists with:

[[1]]
[[1]][[1]]
[1] "7744" "1346" "4626"


Is there a way to avoid these double brackets?
Any response is greatly appreciated!

Suggestion from David (set fixed=T):

> ptm <- proc.time()
> df$result=lapply(lapply(df$findLengthOf,strsplit,split=",",fixed=T), function(x){df[x[[1]],"length"]})
> proc.time() - ptm
user system elapsed
17.220 0.000 17.147
> ptm <- proc.time()
> df$result=lapply(lapply(df$findLengthOf,strsplit,split=","), function(x){df[x[[1]],"length"]})
> proc.time() - ptm
user system elapsed
17.260 0.000 17.142

Answer

Here's a fully vectotorized solution but possibly memory expensive. I haven't tested for performance

library(data.table)
res <- matrix(df$length[unlist(setDT(df)[, 
              tstrsplit(findLengthOf, ",", fixed = TRUE, type.convert = TRUE)])],
              nrow = nrow(df))
df$result <- as.list(as.data.frame(t(res)))
Comments