Nestorghh Nestorghh - 1 month ago 11
R Question

How can I remove repeated characters in a string with R?

I would like to implement a function with

R
that removes repeated characters in a string. For instance, say my function is named
removeRS
, so it is supposed to work this way:

removeRS('Buenaaaaaaaaa Suerrrrte')
Buena Suerte
removeRS('Hoy estoy tristeeeeeee')
Hoy estoy triste


My function is going to be used with strings written in spanish, so it is not that common (or at least correct) to find words that have more than three successive vowels. No bother about the possible sentiment behind them. Nonetheless, there are words that can have two successive consonants (especially ll and rr), but we could skip this from our function.

So, to sum up, this function should replace the letters that appear at least three times in a row with just that letter. In one of the examples above,
aaaaaaaaa
is replaced with
a
.

Could you give me any hints to carry out this task with
R
?

Answer

I did not think very carefully on this, but this is my quick solution using references in regular expressions:

gsub('([[:alpha:]])\\1+', '\\1', 'Buenaaaaaaaaa Suerrrrte')
# [1] "Buena Suerte"

() captures a letter first, \\1 refers to that letter, + means to match it once or more; put all these pieces together, we can match a letter two or more times.

To include other characters besides alphanumerics, replace [[:alpha:]] with a regex matching whatever you wish to include.