giacomoV giacomoV - 3 months ago 12
R Question

R - sample and resample a person-period file

I am working with a gigantic person-period file and I thought that
a good way to deal with a large dataset is by using sampling and re-sampling technique.

My person-period file look like this

id code time
1 1 a 1
2 1 a 2
3 1 a 3
4 2 b 1
5 2 c 2
6 2 b 3
7 3 c 1
8 3 c 2
9 3 c 3
10 4 c 1
11 4 a 2
12 4 c 3
13 5 a 1
14 5 c 2
15 5 a 3


I have actually two distinct issues.

The first issue is that I am having trouble in simply
sampling
a person-period file.

For example, I would like to sample 2 id-sequences such as :

id code time
1 a 1
1 a 2
1 a 3
2 b 1
2 c 2
2 b 3


The following line of code is working for sampling a person-period file

dt[which(dt$id %in% sample(dt$id, 2)), ]


However, I would like to use a
dplyr
solution because I am interested in resampling and in particular I would like to use
replicate
.

I am interested in doing something like
replicate(100, sample_n(dt, 2), simplify = FALSE)


I am struggling with the
dplyr
solution because I am not sure what should be the
grouping
variable.

library(dplyr)
dt %>% group_by(id) %>% sample_n(1)


gives me an incorrect result because it does not keep the full sequence of each
id
.

Any clue how I could both sample and re-sample person-period file ?

data

dt = structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5"
), class = "factor"), code = structure(c(1L, 1L, 1L, 2L, 3L,
2L, 3L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 1L), .Label = c("a", "b",
"c"), class = "factor"), time = structure(c(1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2",
"3"), class = "factor")), .Names = c("id", "code", "time"), row.names = c(NA,
-15L), class = "data.frame")

Answer

I think the idiomatic way would probably look like

set.seed(1)
samp = df %>% select(id) %>% distinct %>% sample_n(2)
left_join(samp, df)

  id code time
1  2    b    1
2  2    c    2
3  2    b    3
4  5    a    1
5  5    c    2
6  5    a    3

This extends straightforwardly to more grouping variables and fancier sampling rules.


If you need to do this many times...

nrep = 100
ng   = 2
samps = df %>% select(id) %>% distinct %>% 
  slice(rep(1:n(), nrep)) %>% mutate(r = rep(1:nrep, each = n()/nrep)) %>%
  group_by(r) %>% sample_n(ng)
repdat = left_join(samps, df)

# then do stuff with it:
repdat %>% group_by(r) %>% do_stuff