Denis Efimov Denis Efimov - 1 year ago 37
R Question

R: deleting the rows of the variables (factor) with a predetermined frequency of occurrence and automatic update levels of factors

I have a set of data containing several variables. One of the variables - factorial contains the designation of groups - A, B, C, etc. The remaining variables are numeric.

> data1
Group Value
1 A 23
2 A 25
3 B 1
4 C 15
5 C 11
6 C 14
7 B 3
8 B 4
9 B 2
10 C 19

For further statistical calculations I want to exclude from the data set the lines that contain a particular group (e.g., X) with the proviso that the group is found in the dataframe n-number of times (e.g., less than 2 times).

The materials that I've seen before mainly concern delete rows with specific values ​​and are not associated with the frequency of occurrence of group (factor) in the dataframe. Maybe I'm wrong? Sorry!

To remove specific rows in the "manual" mode, I use the following code:

data1 <-
lapply(subset(data1, !Group=="A"),
function(x) if(is.factor(x)) factor(x) else x

I would like to automate this process, and to exclude all levels factor (groups) with predetermined occurrence:

> data1
Group Value
1 B 1
2 C 15
3 C 11
4 C 14
5 B 3
6 B 4
7 B 2
8 C 19


Mr. 'Akrun' brought the idea to use the following code:

tbl <- table(data1$Group)
data1 <- subset(data1, Group %in% names(tbl)[tbl>2])

This is what you need! I thank him for that!
However, rezltate factor levels remain unchanged. To correct this, I am forced to use the record:

data1$Group = factor(data1$Group)

Surely there are ready-made solutions taking into account the case?

Answer Source

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data1)), grouped by 'Group', if the number of rows is greater than 2 (.N >2), we get the Subset of Data.table (.SD).

setDT(data1)[, if(.N >2) .SD, by = Group]

Or with dplyr, after grouping by 'Group', filter the groups that have nrows (n()) greater than 2.

data1 %>%
      group_by(Group) %>%
      filter(n() > 2)

Or using base R, we get the frequency of 'Group' with table and %in% in subset to keep the groups.

tbl <- table(data1$Group)
subset(data1, Group %in% names(tbl)[tbl>2])