unitedsaga unitedsaga - 6 months ago 54
R Question

extracting a word from a sentence in R

I am trying to extract word that is followed by a certain letters. For instance in this example I am trying to extract words that follows 'AB'

x = c("So much fun - AB22148",
"AC33648 does whatever",
"I know -AB11025 Failed",
"Nothing stalled - AB16228",
"Unable to do fdS2083D - Ab26604")

Num = character(0)
for (i in 1:length(x)) {
y = unlist(strsplit(x[i]," "))
Num[i] = grep("AB",y, perl = T, value = T, ignore.case = T)

There are couple of issues (as you could probably tell): 1. If 'AB' is not present then I get an error as Num cannot take zero length. 2. If I overcome that (for eg. by replacing AC with AB) then the 5th entry gives me 'unable' instead of "Ab26604".

What I am looking for are: 1. Can it be done without the loop (perhaps using one of the apply function) 2. How to account for the scenario with 3rd and 5th case? [I will like to remove the '-'sign (I can take care of this in the next step but was wondering if it can be done simultaneously)]

Num (current output)
[1] "AB22148" " " "-AB11025" "AB16228" "Unable"

Num (required output)
[1] "AB22148" " " "AB11025" "AB16228" "Ab26604"

Thanks for all the help. I really appreciate it. Kindly let me know if you need additional clarification


You can do something like the following:

str_extract(x, regex("AB[:alnum:]{5}", ignore_case = TRUE))

Which gives you:

"AB22148" NA        "AB11025" "AB16228" "Ab26604"

If you want to replace the NA by " " you can do:

str_replace_na(tmp, " ") # assuming tmp is the result from above

Which gives you:

"AB22148" " "       "AB11025" "AB16228" "Ab26604"