Eran Moshe Eran Moshe - 11 months ago 103
R Question

split data by "column" with an aggregated condition

consider the following data.frame:

> head(dtrain)
content_id item_age item_ctr likes clicks no_clicks event
1 11201926 461540 0.02787456 1 24 837 0
2 11201926 462497 0.02784223 1 24 838 0
3 11201926 473215 0.02780997 1 24 839 0
4 11201926 532983 0.02777778 1 24 840 0
5 11201926 536696 0.02774566 1 24 841 0
6 11201926 545545 0.02771363 1 24 842 0


I want to split the data by content_id which only requires the following command

result <- split(dtrain , f = dtrain$content_id )


But then I want to preserve only the data from dtrain where content_id had at list 1000 appearances (in dtrain). In other words, where the same content_id was present in dtrain more then 1000 times.

In the end, I will have split data by content_id where each split will have at list 1000 occurrences (because that's the aggregated condition)

Answer Source

You can first filter your data frame using dplyr to retain only those content groups with 1000 or more records:

temp <- dtrain
    %>% group_by(content_id)
    %>% filter(n() >= 1000)

and then continue as you were:

result <- split(temp, f=temp$content_id)
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download