ADV - 10 months ago 49

R Question

I have a dataframe of the form shown below. The cases have been pre-clustered into subgroups of varying populations, including singletons. I am trying to write some code that will sample (without replacement) any specified number of rows from the dataframe, but spread as evenly as possible across clusters.

`> testdata`

Cluster Name

1 1 A

2 1 B

3 1 C

4 2 D

5 3 E

6 3 F

7 3 G

8 3 H

9 4 I

10 5 J

11 5 K

12 5 L

13 5 M

14 5 N

15 6 O

16 7 P

17 7 Q

For example, if I ask for a sample of 3 rows, I would like to pull a random row from a random 3 clusters (i.e. not first rows of clusters 1-3 every time, though this is one valid outcome).

Acceptable examples:

`> testdata_subset`

Cluster Name

1 1 A

5 3 E

12 5 L

> testdata_subset

Cluster Name

6 3 F

14 5 N

15 6 O

Incorrect example:

`> testdata_subset`

Cluster Name

6 3 F

8 3 H

13 5 M

The same idea applies up to a sample size of 7 in the example data shown (1 per cluster). For higher sample sizes, I would like to draw from each cluster evenly as far as possible, then evenly across the remaining clusters with unsampled rows, and so on, until the specified number of rows has been sampled.

I know how to sample N rows indiscriminately:

`testdata[sample(nrow(testdata), N),]`

But thispays no regard to the clusters. I used dplyr to randomly sample N rows per cluster:

`ddply(testdata,"Cluster", function(z) z[sample(nrow(z), N),])`

But this fails as soon as you ask for more rows than there are in a cluster (i.e. if N > 1). I then added an if/else statement to begin to handle that:

`numsamp_per_cluster <- 2`

ddply(testdata,"Cluster", function(z) if (numsamp_per_cluster > nrow(z)){z[sample(nrow(z), nrow(z)),]} else {z[sample(nrow(z), numsamp_per_cluster),]})

This effectively caps the sample size asked for to the size of each cluster. But in doing so, it loses control of the overall sample size. I am hoping (but starting to doubt) there is an elegant method using dplyr or similar package that can do this kind of semi-randomised sampling. Either way, I am struggling to tie these elements together and solve the problem.

Answer Source

The strategy: First, you randomly assign the order inside each `cluster`

. This value is stored in the `inside`

variable below. Next, you randomly select the order of the first choices of each cluster and so on (`outside`

variable). Finally, you order your dataframe selecting the first choices, then the second and so on of each cluster, breaking the ties with the `outside`

variable. Something like that:

```
set.seed(1)
inside<-ave(seq_along(testdata$Cluster),testdata$Cluster,FUN=function(x) sample(length(x)))
outside<-ave(inside,inside,FUN=function(x) sample(seq_along(x)))
testdata[order(inside,outside),]
# Cluster Name
#10 5 J
#15 6 O
#4 2 D
#5 3 E
#9 4 I
#16 7 P
#1 1 A
#13 5 M
#3 1 C
#17 7 Q
#7 3 G
#6 3 F
#14 5 N
#2 1 B
#12 5 L
#8 3 H
#11 5 K
```

Now, selecting the first `n`

rows of the resulting data.frame you get the sample you are looking for.