Chirayu Chamoli - 1 year ago 57

R Question

I would like to keep only the top 2 factor levels based on the frequency and group all other factors into Other. I tried this but it doesnt help.

`df=data.frame(a=as.factor(c(rep('D',3),rep('B',5),rep('C',2))),`

b=as.factor(c(rep('A',5),rep('B',5))),

c=as.factor(c(rep('A',3),rep('B',5),rep('C',2))))

myfun=function(x){

if(is.factor(x)){

levels(x)[!levels(x) %in% names(sort(table(x),decreasing = T)[1:2])]='Others'

}

}

df=as.data.frame(lapply(df, myfun))

Expected Output

`a b c`

D A A

D A A

D A A

B A B

B A B

B B B

B B B

B B B

others B others

others B others

Answer Source

This might get a bit messy, but here is one approach via base R,

```
fun1 <- function(x){levels(x) <-
c(names(sort(table(x), decreasing = TRUE)[1:2]),
rep('others', length(levels(x))-2));
return(x)}
```

However the above function will need to first be re-ordered and as OP states in comment, the correct one will be,

```
fun1 <- function(x){ x=factor(x,
levels = names(sort(table(x), decreasing = TRUE)));
levels(x) <- c(names(sort(table(x), decreasing = TRUE)[1:2]),
rep('others', length(levels(x))-2));
return(x) }
```