ash bounty ash bounty - 21 days ago 5
R Question

Easy regex in gsub() function is not working

I have just started with programming in R. Currently, I am practicing feature engineering on the famous Titanic dataset.

Inter alia, I want to extract the title of the persons in my dataset.

I have these:

Montvila, Rev. Juozas
Johnston, Miss. Catherine Helen


And want to get these:

Rev.
Miss.


My own approach is not working. I cant figure out what exactly the problem is:

gsub("([A-Za-z:space:]+, )|(\.[A-Za-z:space:]+)", "", data_raw$Name)


Hope anybody can help me! Would be so great.

Kind regards,
Marcus

Answer

We can match one or more non white space characters (\\S+) from the start (^) of the string followed by one or more whitespace (\\s+) or (|) use a look around to match the . followed by characters until the end of the string and replace it with blank ("")

gsub("^\\S+\\s+|(?<=\\.).*$", "", str1, perl = TRUE)
#[1] "Rev."  "Miss."

Or another option is to capture the characters as a group (([^.]+\\.)) and in the replacement use the backreference (\\1) of that capture group.

sub("^[^,]+,\\s+([^.]+\\.).*", "\\1", str1)
#[1] "Rev."  "Miss."

data

str1 <- c("Montvila, Rev. Juozas", "Johnston, Miss. Catherine Helen")