Moderat - 1 year ago 60
R Question

# Combining factor levels in R 3.2.1

In previous versions of R I could combine factor levels that didn't have a "significant" threshold of volume using the following little function:

``````whittle = function(data, cutoff_val){
#convert to a data frame
tab = as.data.frame.table(table(data))
#returns vector of indices where value is below cutoff_val
idx = which(tab\$Freq < cutoff_val)
levels(data)[idx] = "Other"
return(data)
}
``````

This takes in a factor vector, looks for levels that don't appear "often enough" and combines all of those levels into one "Other" factor level. An example of this is as follows:

``````> sort(table(data\$State))

05    27    35    40    54    84     9    AP    AU    BE    BI    DI     G    GP    GU    GZ    HN    HR    JA    JM    KE    KU     L    LD    LI    MH    NA
1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1
OU     P    PL    RM    SR    TB    TP    TW     U    VD    VI    VS    WS     X    ZH    47    BL    BS    DL     M    MB    NB    RP    TU    11    DU    KA
1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     2     2     2     2     2     2     2     2     2     3     3     3
BW    ND    NS    WY    AK    SD    13    QC    01    BC    MT    AB    HE    ID     J    NO    LN    NM    ON    NE    VT    UT    IA    MS    AO    AR    ME
4     4     4     4     5     5     6     6     7     7     7     8     8     8     9    10    11    17    23    26    26    30    31    31    38    40    44
OR    KS    HI    NV    WI    OK    KY    IN    WV    AL    CO    WA    MN    NH    MO    SC    LA    TN    AZ    IL    NC    MI    GA    OH    **    CT    DE
45    47    48    57    57    64   106   108   112   113   120   125   131   131   135   138   198   200   233   492   511   579   645   646   840   873  1432
RI    DC    TX    MA    FL    VA    MD    CA    NJ    PA    NY
1782  2513  6992  7027 10527 11016 11836 12221 15485 16359 34045
``````

Now when I use
`whittle`
it returns me the following message:

``````> delete = whittle(data\$State, 1000)
Warning message:
In `levels<-`(`*tmp*`, value = c("Other", "Other", "Other", "Other",  :
duplicated levels in factors are deprecated
``````

How can I modify my function so that it has the same effect but doesn't use these "deprecated" factor levels? Converting to a character, tabling, and then converting to the character "Other"?

I've always found it easiest (less typing and less headache) to convert to character and back for these sorts of operations. Keeping with your `as.data.frame.table` and using `replace` to do the replacement of the low-frequency levels:

``````whittle <- function(data, cutoff_val) {
tab = as.data.frame.table(table(data))
factor(replace(as.character(data), data %in% tab\$data[tab\$Freq < cutoff_val], "Other"))
}
``````

Testing on some sample data:

``````state <- factor(c("MD", "MD", "MD", "VA", "TX"))
whittle(state, 2)
# [1] MD    MD    MD    Other Other
# Levels: MD Other
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download