Doug Fir Doug Fir - 1 month ago 12
R Question

Replace values in factor based on frequency of levels

Here is a data frame:

vegetables <- c("carrots", "carrots", "carrots", "carrots", "carrots")
animals <- c("cats", "dogs", "dogs", "fish", "cats")
df <- data.frame(vegetables, animals)


Looks like:

> df
vegetables animals
1 carrots cats
2 carrots dogs
3 carrots dogs
4 carrots fish
5 carrots cats


If I wanted to remove rows where the levels frequency was below e.g. 2 (so fish in the example df) then remove that row:

for ( i in names(df) ) {
df <- subset(df, with(df, df[,i] %in% names(which(table(df[,i]) >= 2))))
}

> df
vegetables animals
1 carrots cats
2 carrots dogs
3 carrots dogs
5 carrots cats


But what if I don't want to remove the observation but instead replace fish with "bla".

How would I do that?

Desired output:

> df
vegetables animals
1 carrots cats
2 carrots dogs
3 carrots dogs
4 carrots bla
5 carrots cats

Answer

We can use data.table

library(data.table)
setDT(df)[df[,  .I[.N > 1], by = .(vegetables, animals)]$V1]

If we want to replace the low frequency item in each column with 'bla'

threshold <- 1
df[] <- lapply(df, as.character)
setDT(df)
for(j in seq_along(df)){
  df[, N := .N, c(names(df)[j])][N == threshold, names(df)[j] := "bla"][, N := NULL][]
  }
Comments