Makaroni - 8 months ago 35

R Question

I want to select half of the dataframe given values in one of the columns. In other words, from the dataframe given below I need to extract half of rows given values in column Y:

`DF:`

id1 column Y value

9830 A 6

7609 A 0

9925 B 0

9922 B 5

9916 B 6

9917 B 8

9914 C 2

9914 C 7

9914 C 7

9914 C 2

9914 C 9

New data frame should look like this:

`NEW DF:`

id1 column Y value

9830 A 6

9925 B 0

9922 B 5

9914 C 2

9914 C 7

Also, it would be helpful to know solution for selecting random half of all rows datefram DF given column Y (e.g. not selecting first 50%).

Any help is appreciated.

Thanks!

Answer

Assuming you want the first half of each group of rows with the same value for `column Y`

where for odd number of rows we round down, we can use `filter`

from `dplyr`

:

```
library(dplyr)
df %>% group_by(`column Y`) %>% filter(row_number() <= floor(n()/2))
##Source: local data frame [5 x 3]
##Groups: column Y [3]
##
## id1 column Y laclen
## <int> <fctr> <int>
##1 9830 A 6
##2 9925 B 0
##3 9922 B 5
##4 9914 C 2
##5 9914 C 7
```

We first `group_by`

`column Y`

(note the back quotes since the column name contains space) and then use `filter`

to keep only the rows for which the `row_number`

is less than or equal to the total number of rows given by `n()`

for the group divided by `2`

(and rounded down with `floor`

).

To select a random 50% of the rows in each group, use `sample`

to generate the row numbers to keep and `%in%`

to match those to keep:

```
set.seed(123)
result <- df %>% group_by(`column Y`) %>% filter(row_number() %in% sample(seq_len(n()),floor(n()/2)))
##Source: local data frame [5 x 3]
##Groups: column Y [3]
##
## id1 column Y laclen
## <int> <fctr> <int>
##1 9830 A 6
##2 9922 B 5
##3 9917 B 8
##4 9914 C 2
##5 9914 C 9
```

Source (Stackoverflow)