Rambo Rambo - 1 month ago 6
R Question

R - gsub function

EDIT I have an input dataframe like this:

enter image description here

I want the output to be like this:

enter image description here

Please find my explanation below. I literally dont know to give a detailed explanation more than this :(

enter image description here

Let me explain.... In the input dataset, for rows that have COL1 values as "10", I want to scan the COL2 values and replace any repeating text patterns with "*"... The same logic goes for all COL2 values which have duplicate COL1 values..
I want to use gsub function for this..

I tried gsub along with paste several times and am not getting the desired output as I do not know how to match all the patterns inside the duplicates.

I have already asked this question. But since I did not receive an answer, I'm re-posting it.

Attaching the dput of input dataframe below:

structure(list(COL1 = c(10L, 10L, 10L, 20L, 20L, 30L, 30L, 40L,
40L, 40L, 50L, 50L, 50L), COL2 = c("mary has life", "Don mary has life",
"Britto mary has life", "push them fur", "push them ", "yell at this",
"this is yell at this", "Year", "Doggy", "Horse", "This is great job",
"great job", "Donkey")), .Names = c("COL1", "COL2"), row.names = c(NA,
-13L), class = "data.frame")

Answer

You can write a function to run gsub for each item in a group and select the shortest replacement (aside from itself, of course):

fun <- function(col){
    matches <- sapply(col, function(x){gsub(x, '\\*', col)}); 
    diag(matches) <- NA; 
    apply(matches, 1, function(x){x[which.min(nchar(x))]})
}

Now implement in your favorite grammar:

library(dplyr)

df %>% group_by(COL1) %>% mutate(COL3 = fun(COL2))

## Source: local data frame [13 x 3]
## Groups: COL1 [5]
## 
##     COL1                 COL2          COL3
##    <int>                <chr>         <chr>
## 1     10        mary has life mary has life
## 2     10    Don mary has life         Don *
## 3     10 Britto mary has life      Britto *
## 4     20        push them fur          *fur
## 5     20           push them     push them 
## 6     30         yell at this  yell at this
## 7     30 this is yell at this     this is *
## 8     40                 Year          Year
## 9     40                Doggy         Doggy
## 10    40                Horse         Horse
## 11    50    This is great job     This is *
## 12    50            great job     great job
## 13    50               Donkey        Donkey

Or keep it all in base R:

df$COL3 <- ave(df$COL2, df$COL1, FUN = fun)

df

##    COL1                 COL2          COL3
## 1    10        mary has life mary has life
## 2    10    Don mary has life         Don *
## 3    10 Britto mary has life      Britto *
## 4    20        push them fur          *fur
## 5    20           push them     push them 
## 6    30         yell at this  yell at this
## 7    30 this is yell at this     this is *
## 8    40                 Year          Year
## 9    40                Doggy         Doggy
## 10   40                Horse         Horse
## 11   50    This is great job     This is *
## 12   50            great job     great job
## 13   50               Donkey        Donkey