Makaroni Makaroni - 2 months ago 9
R Question

Select half of the dataframe given values in one column

I want to select half of the dataframe given values in one of the columns. In other words, from the dataframe given below I need to extract half of rows given values in column Y:

DF:
id1 column Y value
9830 A 6
7609 A 0
9925 B 0
9922 B 5
9916 B 6
9917 B 8
9914 C 2
9914 C 7
9914 C 7
9914 C 2
9914 C 9


New data frame should look like this:

NEW DF:
id1 column Y value
9830 A 6
9925 B 0
9922 B 5
9914 C 2
9914 C 7


Also, it would be helpful to know solution for selecting random half of all rows datefram DF given column Y (e.g. not selecting first 50%).

Any help is appreciated.
Thanks!

Answer

Assuming you want the first half of each group of rows with the same value for column Y where for odd number of rows we round down, we can use filter from dplyr:

library(dplyr)
df %>% group_by(`column Y`) %>% filter(row_number() <= floor(n()/2))
##Source: local data frame [5 x 3]
##Groups: column Y [3]
##
##    id1 column Y laclen
##  <int>   <fctr>  <int>
##1  9830        A      6
##2  9925        B      0
##3  9922        B      5
##4  9914        C      2
##5  9914        C      7

We first group_by column Y (note the back quotes since the column name contains space) and then use filter to keep only the rows for which the row_number is less than or equal to the total number of rows given by n() for the group divided by 2 (and rounded down with floor).


To select a random 50% of the rows in each group, use sample to generate the row numbers to keep and %in% to match those to keep:

set.seed(123)
result <- df %>% group_by(`column Y`) %>% filter(row_number() %in% sample(seq_len(n()),floor(n()/2)))
##Source: local data frame [5 x 3]
##Groups: column Y [3]
##
##    id1 column Y laclen
##  <int>   <fctr>  <int>
##1  9830        A      6
##2  9922        B      5
##3  9917        B      8
##4  9914        C      2
##5  9914        C      9