Trotte Boman Trotte Boman - 3 years ago 51
R Question

Choosing different amount of elements from each group in R

I am working on the Kaggle Instacart competition, but I am quite new to R and have run into something I can not figure out.

I have a dataset with 4 columns. The first column is an order ID (id1). The second column is a product ID (id2). The third column is the probability that I want select the product id2 from the order id1 which we can consider just as a ranking, so a higher probability is always selected over a smaller probability. Finally, the fourth column is the amount of products I want to select from the given order (a feature of the order). So for example, I have here the first 12 rows of the dataframe df:

id1 id2 prob num
1 17 13107 0.4756982 3
2 17 21463 0.3724126 3
3 17 38777 0.3534422 3
4 17 21709 0.3364623 3
5 17 47766 0.3364623 3
6 17 39275 0.3165896 3
7 34 16083 0.4093785 4
8 34 39475 0.3892882 4
9 34 47766 0.3892882 4
10 34 2596 0.3837562 4
11 34 21137 0.3762758 4
12 34 47792 0.3737032 4


We can see that from the id1 = 17 I want to choose 3 elements, and for id1 = 34 I want to choose 4 elements. The result should then be

ID1 ID2
17 13107, 21463, 38777
34 16083, 39475, 47766, 2596


or something similar to this.

At the moment I have tried using

df %>% group_by(id1) %>% top_n(n = num)


but I get the error

Selecting by num
Error in is_scalar_integerish(n) : object 'num' not found


Anyone know how I would go about doing this?

Thanks

Answer Source

You can pipe the grouped data directly into a summarise statement:

df %>% group_by(id1) %>% summarise(id2 = toString(id2[seq_len(first(num))]))
## A tibble: 2 x 2
#    id1                       id2
#  <int>                     <chr>
#1    17       13107, 21463, 38777
#2    34 16083, 39475, 47766, 2596

In this statement, the id2[seq_len(first(num))] is used to extract the first num per group, create a sequence from 1 to the num and that sequence is used to subset the first X id2 values.

The toString creates a string per id1 group.


Here's another base R option using aggregate:

aggregate(id2 ~ id1, FUN=toString, subset(df, ave(id1, id1, FUN=seq_along) <= num))
#  id1                       id2
#1  17       13107, 21463, 38777
#2  34 16083, 39475, 47766, 2596
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download