RDPD RDPD - 2 months ago 22
R Question

R strsplit() with multiple criteria

I am trying to split sentences based on different criteria. I am looking to split some sentences after " is" and some after " never". I was able to split sentences based on either of these conditions but not both.

str <- matrix(c("This is line one", "This is not line one",
"This can never be line one"), nrow = 3, ncol = 1)

>str
[,1]
[1,] "This is line one"
[2,] "This is not line one"
[3,] "This can never be line one"

str2 <- apply(str, 1, function (x) strsplit(x, " is", fixed = TRUE))

> str2
[[1]]
[[1]][[1]]
[1] "This" " line one"


[[2]]
[[2]][[1]]
[1] "This" " not line one"


[[3]]
[[3]][[1]]
[1] "This can never be line one"


I would like to split the last sentence after " never". I am not sure how to do that.

Answer

We can use regex lookarounds to split the lines at the space after the 'is' or 'never'. Here, the (?<=\\bis)\\s+ matches one or more spaces (\\s+) that follows a is or | to match spaces (\\s+) that follows the 'never' word.

strsplit(str[,1], "(?<=\\bis)\\s+|(?<=\\bnever)\\s+", perl = TRUE)
#[[1]]
#[1] "This is"  "line one"

#[[2]]
#[1] "This is"      "not line one"

#[[3]]
#[1] "This can never" "be line one"   

If we want to remove the 'is' and 'never' also

strsplit(str[,1], "(?:\\s+(is|never)\\s+)")
#[[1]]
#[1] "This"     "line one"

#[[2]]
#[1] "This"         "not line one"

#[[3]]
#[1] "This can"    "be line one"
Comments