Doug Fir - 1 month ago 12
R Question

# Replace values in factor based on frequency of levels

Here is a data frame:

``````vegetables <- c("carrots", "carrots", "carrots", "carrots", "carrots")
animals <- c("cats", "dogs", "dogs", "fish", "cats")
df <- data.frame(vegetables, animals)
``````

Looks like:

``````> df
vegetables animals
1    carrots    cats
2    carrots    dogs
3    carrots    dogs
4    carrots    fish
5    carrots    cats
``````

If I wanted to remove rows where the levels frequency was below e.g. 2 (so fish in the example df) then remove that row:

``````for ( i in names(df) ) {
df <- subset(df, with(df, df[,i] %in% names(which(table(df[,i]) >= 2))))
}

> df
vegetables animals
1    carrots    cats
2    carrots    dogs
3    carrots    dogs
5    carrots    cats
``````

But what if I don't want to remove the observation but instead replace fish with "bla".

How would I do that?

Desired output:

``````> df
vegetables animals
1    carrots    cats
2    carrots    dogs
3    carrots    dogs
4    carrots    bla
5    carrots    cats
``````

We can use `data.table`

``````library(data.table)
setDT(df)[df[,  .I[.N > 1], by = .(vegetables, animals)]\$V1]
``````

If we want to replace the low frequency item in each column with 'bla'

``````threshold <- 1
df[] <- lapply(df, as.character)
setDT(df)
for(j in seq_along(df)){
df[, N := .N, c(names(df)[j])][N == threshold, names(df)[j] := "bla"][, N := NULL][]
}
``````