vahab - 4 months ago 13

R Question

Let's say I have a data frame with two columns for now:

`df<- data.frame(scores_set1=c(32,45,65,96,45,23,23,14),`

scores_set2=c(32,40,60,98,21,23,21,63))

I want to randomly select some rows

`selected_indeces<- sample(c(1:8), 4, replace = FALSE)`

Now I want to add up the values of

`selected_indeces`

`selected_indeces`

`cumulative_loss<-matrix(rep(NA,8*2),nrow=8,ncol=2)`

and then one loop for each column and another for each selected_index

`for (s in 1:ncol(df)) #for each column`

{

for (i in 1:length(selected_indeces)) #for each randomly selected index

{

if (i==1)

{

cumulative_loss[i,s]<- df[selected_indeces[i],s]

}

if (i > 1)

{

cumulative_loss[i,s]<- df[selected_indeces[i],s] +

df[selected_indeces[i-1],s]

}

}

}

The script works although It might be a naive way for doing such thing but the thing is that if (i=4) is only adds values of 4th and third selection while I want it to add first, second , third and fourth random selection and return it.

Answer

Here's a way to do this with `data.table`

(taking into account your comment on @bgoldst's answer:

```
library(data.table); setDT(df)
#sample 4 elements of each column (i.e., every element of .SD), then cumsum them
df[ , lapply(.SD, function(x) cumsum(sample(x, 4)))]
```

If you want to use different indices for each column, I would pre-choose them first:

```
set.seed(1023)
idx <- lapply(integer(ncol(df)), function(...) sample(nrow(df), 4))
idx
# [[1]] #indices for column 1
# [1] 2 8 6 3
#
# [[2]] #indices for column 2
# [1] 4 8 5 1
```

Then modify the above slightly:

```
df[ , lapply( seq_along(.SD), function(jj) cumsum(.SD[[jj]][ idx[[jj]] ]) )]
```

This is the craziest compendium of brackets/parentheses I've ever written in a functional line of code, so I guess it makes sense to break things down a bit:

`seq_along`

`.SD`

to pick out the*index number*of each column,`jj`

`.SD[[jj]]`

selects the`j`

th column,`idx[[jj]]`

selects the indices for that column,`.SD[jj]][idx[jj]]]`

picks the`idx[[jj]]`

rows of the`j`

th column; this is equivalent to`.SD[idx[jj], jj, with = FALSE]`

- Lastly, we
`cumsum`

the`length(idx[[jj]])`

rows we chose for column`jj`

.

Result:

```
# V1 V2
# 1: 45 98
# 2: 59 161
# 3: 82 182
# 4: 147 214
```