Léo Joubert Léo Joubert - 2 months ago 6x
R Question

Comparing two version of the same string

I would like to write a function that compare two string in R. More precisely, if a have this data :

data <- list(
"First sentence.",
"Very first sentence.",
"Very first and only one sentences."

I would like the output to be :

[1] "Very" " and only one sentences"

My output is built by all substring that is not included in the previous one. For example:

2nd vs 1st, remove matching string - "first sentence." - from the 2nd, so result is "Very".

# "First sentence."
# "Very first sentence."
# match: ^^^^^^^^^^^^^^^

Now compare 3rd vs 2nd, remove matching string - "very first" - from 3rd , so result is " and only one sentences".

# "Very first sentence."
# "Very first and only one sentences."
# match: ^^^^^^^^^^

Then compare 4th vs 3rd, etc...

So based on this example my output should be:

c("Very", " and only one sentences")
# [1] "Very" " and only one sentences"


Here's a tidyverse approach:


# put data in a data.frame
data_frame(string = unlist(data)) %>% 
    # add ID column so we can recombine later
    add_rownames('id') %>% 
    # add a lagged column to compare against
    mutate(string2 = lag(string)) %>% 
    # break strings into words
    separate_rows(string) %>% 
    # evaluate the following calls rowwise (until regrouped)
    rowwise() %>% 
    # chop to rows with a string to compare against,
           # where the word is not in the comparison string
           !grepl(string, string2, ignore.case = TRUE)) %>% 
    # regroup by ID
    group_by(id) %>%
    # reassemble strings
    summarise(string = paste(string, collapse = ' '))

## # A tibble: 2 x 2
##      id                  string
##   <chr>                   <chr>
## 1     2                    Very
## 2     3 and only one sentences.

Select out string if you'd like just a vector by appending

    %>% `[[`('string')

## [1] "Very"                    "and only one sentences."