Denis Efimov - 2 months ago 5x
R Question

# R: deleting the rows of the variables (factor) with a predetermined frequency of occurrence

I have a set of data containing several variables. One of the variables - factorial contains the designation of groups - A, B, C, etc. The remaining variables are numeric.

`````` > data1
Group Value
1      A    23
2      A    25
3      B     1
4      C    15
5      C    11
6      C    14
7      B     3
8      B     4
9      B     2
10     C    19
``````

For further statistical calculations I want to exclude from the data set the lines that contain a particular group (e.g., X) with the proviso that the group is found in the dataframe n-number of times (e.g., less than 2 times).

The materials that I've seen before mainly concern delete rows with specific values ​​and are not associated with the frequency of occurrence of group (factor) in the dataframe. Maybe I'm wrong? Sorry!

To remove specific rows in the "manual" mode, I use the following code:

``````data1 <- as.data.frame(
lapply(subset(data1, !Group=="A"),
function(x) if(is.factor(x)) factor(x) else x
)
)
``````

I would like to automate this process, and to exclude all levels factor (groups) with predetermined occurrence:

``````> data1
Group Value
1     B     1
2     C    15
3     C    11
4     C    14
5     B     3
6     B     4
7     B     2
8     C    19
``````

We can use `data.table`. Convert the 'data.frame' to 'data.table' (`setDT(data1)`), grouped by 'Group', `if` the number of rows is greater than 2 (`.N >2`), we get the Subset of Data.table (`.SD`).

``````library(data.table)
setDT(data1)[, if(.N >2) .SD, by = Group]
``````

Or with `dplyr`, after grouping by 'Group', `filter` the groups that have nrows (`n()`) greater than 2.

``````library(dplyr)
data1 %>%
group_by(Group) %>%
filter(n() > 2)
``````

Or using `base R`, we get the frequency of 'Group' with `table` and `%in%` in `subset` to keep the groups.

``````tbl <- table(data1\$Group)
subset(data1, Group %in% names(tbl)[tbl>2])
``````