Morpheu5 Morpheu5 - 16 days ago 6
R Question

How do I sample single (random) rows that can be grouped by a column's values?

Here is a sample of the data

p <- structure(list(name = structure(1:5, .Label = c("Alice", "Bob",
"Charlie", "Dennis", "Earl"), class = "factor"), cohort = structure(c(3L,
3L, 2L, 2L, 1L), .Label = c("X", "Y", "Z"), class = "factor"),
group = structure(c(1L, 1L, 2L, 2L, 1L), .Label = c("A",
"B"), class = "factor"), var = c(1L, 2L, 1L, 3L, 4L)), .Names = c("name",
"cohort", "group", "var"), class = "data.frame", row.names = c(NA,
-5L))


that looks like

name cohort group var
1 Alice Z A 1
2 Bob Z A 2
3 Charlie Y B 1
4 Dennis Y B 3
5 Earl X A 4


and I need something like the following, based on the
cohort
column. I need to sample one row in each
cohort
(possibly randomly) so that I don't have multiple people belonging to the same
cohort
.

name cohort group var
2 Bob Z A 2
3 Charlie Y B 1
5 Earl X A 4


I can
group_by
cohort, but then I'm not sure how to proceed to create a new data frame with only the rows that I need.

Answer

You can group by cohort and pipe it to sample_n where 1 indicates that you want one sample per group

library(dplyr)

p %>% group_by(cohort) %>% sample_n(1)

Source: local data frame [3 x 4]
Groups: cohort [3]

name cohort  group   var
(fctr) (fctr) (fctr) (int)
1   Earl      X      A     4
2 Dennis      Y      B     3
3  Alice      Z      A     1

Second run:

 name cohort  group   var
 (fctr) (fctr) (fctr) (int)
 1    Earl      X      A     4
 2 Charlie      Y      B     1
 3     Bob      Z      A     2
Comments