Denis Efimov - 7 months ago 36

R Question

I have a set of data containing several variables. One of the variables - factorial contains the designation of groups - A, B, C, etc. The remaining variables are numeric.

`> data1`

Group Value

1 A 23

2 A 25

3 B 1

4 C 15

5 C 11

6 C 14

7 B 3

8 B 4

9 B 2

10 C 19

For further statistical calculations I want to exclude from the data set the lines that contain a particular group (e.g., X) with the proviso that the group is found in the dataframe n-number of times (e.g., less than 2 times).

The materials that I've seen before mainly concern delete rows with specific values and are not associated with the frequency of occurrence of group (factor) in the dataframe. Maybe I'm wrong? Sorry!

To remove specific rows in the "manual" mode, I use the following code:

`data1 <- as.data.frame(`

lapply(subset(data1, !Group=="A"),

function(x) if(is.factor(x)) factor(x) else x

)

)

I would like to automate this process, and to exclude all levels factor (groups) with predetermined occurrence:

`> data1`

Group Value

1 B 1

2 C 15

3 C 11

4 C 14

5 B 3

6 B 4

7 B 2

8 C 19

Answer

We can use `data.table`

. Convert the 'data.frame' to 'data.table' (`setDT(data1)`

), grouped by 'Group', `if`

the number of rows is greater than 2 (`.N >2`

), we get the Subset of Data.table (`.SD`

).

```
library(data.table)
setDT(data1)[, if(.N >2) .SD, by = Group]
```

Or with `dplyr`

, after grouping by 'Group', `filter`

the groups that have nrows (`n()`

) greater than 2.

```
library(dplyr)
data1 %>%
group_by(Group) %>%
filter(n() > 2)
```

Or using `base R`

, we get the frequency of 'Group' with `table`

and `%in%`

in `subset`

to keep the groups.

```
tbl <- table(data1$Group)
subset(data1, Group %in% names(tbl)[tbl>2])
```