R Question

R strsplit() with multiple criteria

I am trying to split sentences based on different criteria. I am looking to split some sentences after " is" and some after " never". I was able to split sentences based on either of these conditions but not both.

str <- matrix(c("This is line one", "This is not line one",
"This can never be line one"), nrow = 3, ncol = 1)

[1,] "This is line one"
[2,] "This is not line one"
[3,] "This can never be line one"

str2 <- apply(str, 1, function (x) strsplit(x, " is", fixed = TRUE))

> str2
[1] "This" " line one"

[1] "This" " not line one"

[1] "This can never be line one"

I would like to split the last sentence after " never". I am not sure how to do that.

Answer Source

We can use regex lookarounds to split the lines at the space after the 'is' or 'never'. Here, the (?<=\\bis)\\s+ matches one or more spaces (\\s+) that follows a is or | to match spaces (\\s+) that follows the 'never' word.

strsplit(str[,1], "(?<=\\bis)\\s+|(?<=\\bnever)\\s+", perl = TRUE)
#[1] "This is"  "line one"

#[1] "This is"      "not line one"

#[1] "This can never" "be line one"   

If we want to remove the 'is' and 'never' also

strsplit(str[,1], "(?:\\s+(is|never)\\s+)")
#[1] "This"     "line one"

#[1] "This"         "not line one"

#[1] "This can"    "be line one"
