sven b - 1 year ago 40

R Question

Given a dataframe with two columns:

- length (length of elements)
- findLengthOf (This is a string of values) The index of the elements for which the length is needed

So one has to find all the length of all indexes in the second column and put the result in a third column.

Please see above example, where we search for the lenght of 1637 and obtain 1835:

`> df$length[1637]`

[1] 1835

head(df)

length findLengthOf

1 6434 1637,386....

2 4272 4322,414....

3 7338 2052,639....

4 4932 190,1567....

5 2397 8963,844....

6 4405 103,4346....

head(df)

length findLengthOf result

1 6434 1637,386.... 1835, 2404, 4689

2 4272 4322,414.... 1184, 2721, 7215

3 7338 2052,639.... 5253, 2998, 6153

4 4932 190,1567.... 2931, 6496, 7784

5 2397 8963,844.... 3796, 3488, 6555

6 4405 103,4346.... 1662, 5481, 1244

set.seed(123)

df <- data.frame(length = sample(1e4),

findLengthOf = I(replicate(1e4, paste(sample(1:10000,1),sample(1:10000,1),sample(1:10000,1),sep=","), simplify = FALSE)))

df$result=lapply(lapply(df$findLengthOf,strsplit,split=","), function(x){df[x[[1]],"length"]})

Code works, but it takes to long. How can I improve the speed?

Also why does

`head(lapply(df$findLengthOf,strsplit,split=","))`

always return this weird list of lists with:

`[[1]]`

[[1]][[1]]

[1] "7744" "1346" "4626"

Is there a way to avoid these double brackets?

Any response is greatly appreciated!

Suggestion from David (set fixed=T):

`> ptm <- proc.time()`

> df$result=lapply(lapply(df$findLengthOf,strsplit,split=",",fixed=T), function(x){df[x[[1]],"length"]})

> proc.time() - ptm

user system elapsed

17.220 0.000 17.147

> ptm <- proc.time()

> df$result=lapply(lapply(df$findLengthOf,strsplit,split=","), function(x){df[x[[1]],"length"]})

> proc.time() - ptm

user system elapsed

17.260 0.000 17.142

Answer Source

Here's a fully vectotorized solution but possibly memory expensive. I haven't tested for performance

```
library(data.table)
res <- matrix(df$length[unlist(setDT(df)[,
tstrsplit(findLengthOf, ",", fixed = TRUE, type.convert = TRUE)])],
nrow = nrow(df))
df$result <- as.list(as.data.frame(t(res)))
```