Lianzinho Lianzinho - 2 months ago 8
R Question

Add a new level to a factor and substitute existing one

I'm having a big trouble on dealing with levels names of a data frame.

I have a big data frame in which one of the colums is a factor with a LOT of levels.

The problem is that some of this data are duplicated and the next step in my analysis do not accept duplicated data. So I need to change the name of the duplicated level so I can move on to my next step.

Let me give you a little example:

Say we have this simple data frame with one colum:

> df
col_foo
1 bar1
2 bar2
3 bar3
4 bar2
5 bar4
6 bar5
7 bar3


If we look at the column, we see that it is a factor with 5 distinct levels.

>df$col_foo
[1] bar1 bar2 bar3 bar2 bar4 bar5 bar3
Levels: bar1 bar2 bar3 bar4 bar5


Ok, the problem comes now. See that levels
bar2
and
bar3
are duplicated. What I want to know is how can I add a level name, something like
bar2_X
and substitute only the duplicated one for this. So the dataframe should become this:

> df
col_foo
1 bar1
2 bar2
3 bar3
4 bar2_X
5 bar4
6 bar5
7 bar3_X


Is that possible ? I cannot change the class of the column, it should still be a factor, so solutions that need to change it will not solve my problem unless it is possible to coerce to factor again.

Thanks

Answer

If you want all the entries to be unique then a factor does not gain you much over just using a character variable.

Probably the simplest way to do what you want is to coerce to a character vector, use the duplicated function to find the duplicates and paste something onto the end of them, then if you want use factor to recoerce it back to a factor. Possibly something like:

df$col_foo <- factor( ifelse( duplicated(df$col_fo), 
                    paste(df$col_foo, '_x', sep=''), as.character(df$col_foo)))