user3357059 user3357059 - 3 months ago 9
R Question

Using R how to separate a string based on characters

I have a set of strings and I need to search by words that have a period in the middle. Some of the strings are concatenated so I need to break them apart in to words so that I can then filter for words with dots.

Below is a sample of what I have and what I get so far

punctToRemove <- c("[^[:alnum:][:space:]._]")

s <- c("get_degree('TITLE',PERS.ID)",
"CLIENT_NEED.TYPE_CODe=21",
"2.1.1Report Field Level Definition",
"The user defined field. The user will validate")


This is what I currently get

gsub(punctToRemove, " ", s)

[1] "get_degree TITLE PERS.ID "
[2] "CLIENT_NEED.TYPE_CODe 21"
[3] "2.1.1Report Field Level Definition"
[4] "The user defined field. The user will validate"


Sample of what I want is below

[1] "get_degree ( ' TITLE ' , PERS.ID ) " # spaces before and after the "(", "'", ",",and ")"
[2] "CLIENT_NEED.TYPE_CODe = 21" # spaces before and after the "=" sign. Dot and underscore remain untouched.
[3] "2.1.1Report Field Level Definition" # no changes
[4] "The user defined field. The user will validate" # no changes

Answer

We can use regex lookarounds

s1 <- gsub("(?<=['=(),])|(?=['(),=])", " ", s, perl = TRUE)
s1
#[1] "get_degree ( ' TITLE ' , PERS.ID ) "           
#[2] "CLIENT_NEED.TYPE_CODe = 21"                    
#[3] "2.1.1Report Field Level Definition"            
#[4] "The user defined field. The user will validate"

nchar(s1)
#[1] 35 26 34 46

which is equal to the number of characters showed in the OP's expected output.