mike805 mike805 - 10 days ago 5
R Question

text manipulation in R

I am trying to add parentheses around certain book titles character strings and I want to be able to paste with the

paste0
function. I want to take this string:

a <- c("I Like What I Know 1959 02e pdfDrama (amazon.com)", "My Liffe 1993 07e pdfDrama (amazon.com)")


wrap certain strings in parentheses:

a
[1] “I Like What I Know (1959) (02e) (pdfDrama) (amazon.com)”
[2] ”My Life (1993) (07e) (pdfDrama) (amazon.com)”


I have tried but can't figure out a way to replace them within the string:

paste0("(",str_extract(a, "\\d{4}"),")")
paste0("(",str_extract(a, ”[0-9]+.e”),”)”)


Help?

Answer

I can suggest a regex for a fixed number of words of specific type:

a <- c("I Like What I Know 1959 02e pdfDrama (amazon.com)","My Life 1993 07e pdfDrama (amazon.com)")
sub("\\b(\\d{4})(\\s+)(\\d+e)(\\s+)([a-zA-Z]+)(\\s+\\([^()]*\\))", "(\\1)\\2(\\3)\\4(\\5)\\6", a)

See the R demo

And here is the regex demo. In short,

  • \\b(\\d{4}) - captures 4 digits as a whole word into Group 1
  • (\\s+) - Group 2: one or more whitespaces
  • (\\d+e) - Group 3: one or more digits and e
  • (\\s+) - Group 4: ibid
  • ([a-zA-Z]+) - Group 5: one or more letters
  • (\\s+\\([^()]*\\)) - Group 6: one or more whitespaces, (, zero or more chars other than ( and ), ).

The contents of the groups are inserted back into the result with the help of backreferences.

If there are more words, and you need to wrap words starting with a letter/digit/underscore after a 4-digit word in the string, use

gsub("(?:(?=\\b\\d{4}\\b)|\\G(?!\\A))\\s*\\K\\b(\\S+)", "(\\1)", a, perl=TRUE)

See the R demo and a regex demo

Details:

  • (?:(?=\\b\\d{4}\\b)|\\G(?!\\A)) - either the location before a 4-digit whole word (see the positive lookahead (?=\\b\\d{4}\\b)) or the end of the previous successful match
  • \\s* - 0+ whitespaces
  • \\K - omitting the text matched so far
  • \\b(\\S+) - Group 1 capturing 1 or more non-whitespace symbols that are preceded with a word boundary.