Trotte Boman - 3 years ago 51
R Question

# Choosing different amount of elements from each group in R

I am working on the Kaggle Instacart competition, but I am quite new to R and have run into something I can not figure out.

I have a dataset with 4 columns. The first column is an order ID (id1). The second column is a product ID (id2). The third column is the probability that I want select the product id2 from the order id1 which we can consider just as a ranking, so a higher probability is always selected over a smaller probability. Finally, the fourth column is the amount of products I want to select from the given order (a feature of the order). So for example, I have here the first 12 rows of the dataframe df:

``````        id1        id2       prob       num
1        17      13107   0.4756982        3
2        17      21463   0.3724126        3
3        17      38777   0.3534422        3
4        17      21709   0.3364623        3
5        17      47766   0.3364623        3
6        17      39275   0.3165896        3
7        34      16083   0.4093785        4
8        34      39475   0.3892882        4
9        34      47766   0.3892882        4
10       34       2596   0.3837562        4
11       34      21137   0.3762758        4
12       34      47792   0.3737032        4
``````

We can see that from the id1 = 17 I want to choose 3 elements, and for id1 = 34 I want to choose 4 elements. The result should then be

``````ID1     ID2
17     13107, 21463, 38777
34     16083, 39475, 47766, 2596
``````

or something similar to this.

At the moment I have tried using

``````df %>% group_by(id1) %>% top_n(n = num)
``````

but I get the error

``````Selecting by num
``````

Anyone know how I would go about doing this?

Thanks

You can pipe the grouped data directly into a `summarise` statement:

``````df %>% group_by(id1) %>% summarise(id2 = toString(id2[seq_len(first(num))]))
## A tibble: 2 x 2
#    id1                       id2
#  <int>                     <chr>
#1    17       13107, 21463, 38777
#2    34 16083, 39475, 47766, 2596
``````

In this statement, the `id2[seq_len(first(num))]` is used to extract the first `num` per group, create a sequence from 1 to the `num` and that sequence is used to subset the first X `id2` values.

The `toString` creates a string per id1 group.

Here's another base R option using `aggregate`:

``````aggregate(id2 ~ id1, FUN=toString, subset(df, ave(id1, id1, FUN=seq_along) <= num))
#  id1                       id2
#1  17       13107, 21463, 38777
#2  34 16083, 39475, 47766, 2596
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download