Denis Efimov - 1 year ago 47

R Question

I have a set of data containing several variables. One of the variables - factorial contains the designation of groups - A, B, C, etc. The remaining variables are numeric.

`> data1`

Group Value

1 A 23

2 A 25

3 B 1

4 C 15

5 C 11

6 C 14

7 B 3

8 B 4

9 B 2

10 C 19

For further statistical calculations I want to exclude from the data set the lines that contain a particular group (e.g., X) with the proviso that the group is found in the dataframe n-number of times (e.g., less than 2 times).

The materials that I've seen before mainly concern delete rows with specific values and are not associated with the frequency of occurrence of group (factor) in the dataframe. Maybe I'm wrong? Sorry!

To remove specific rows in the "manual" mode, I use the following code:

`data1 <- as.data.frame(`

lapply(subset(data1, !Group=="A"),

function(x) if(is.factor(x)) factor(x) else x

)

)

I would like to automate this process, and to exclude all levels factor (groups) with predetermined occurrence:

`> data1`

Group Value

1 B 1

2 C 15

3 C 11

4 C 14

5 B 3

6 B 4

7 B 2

8 C 19

Mr. 'Akrun' brought the idea to use the following code:

`tbl <- table(data1$Group)`

data1 <- subset(data1, Group %in% names(tbl)[tbl>2])

This is what you need! I thank him for that!

However, rezltate factor levels remain unchanged. To correct this, I am forced to use the record:

`data1$Group = factor(data1$Group)`

Surely there are ready-made solutions taking into account the case?

Answer Source

We can use `data.table`

. Convert the 'data.frame' to 'data.table' (`setDT(data1)`

), grouped by 'Group', `if`

the number of rows is greater than 2 (`.N >2`

), we get the Subset of Data.table (`.SD`

).

```
library(data.table)
setDT(data1)[, if(.N >2) .SD, by = Group]
```

Or with `dplyr`

, after grouping by 'Group', `filter`

the groups that have nrows (`n()`

) greater than 2.

```
library(dplyr)
data1 %>%
group_by(Group) %>%
filter(n() > 2)
```

Or using `base R`

, we get the frequency of 'Group' with `table`

and `%in%`

in `subset`

to keep the groups.

```
tbl <- table(data1$Group)
subset(data1, Group %in% names(tbl)[tbl>2])
```