I have just started with programming in R. Currently, I am practicing feature engineering on the famous Titanic dataset.
Inter alia, I want to extract the title of the persons in my dataset.
I have these:
Montvila, Rev. Juozas
Johnston, Miss. Catherine Helen
gsub("([A-Za-z:space:]+, )|(\.[A-Za-z:space:]+)", "", data_raw$Name)
We can match one or more non white space characters (
\\S+) from the start (
^) of the string followed by one or more whitespace (
\\s+) or (
|) use a look around to match the
. followed by characters until the end of the string and replace it with blank (
gsub("^\\S+\\s+|(?<=\\.).*$", "", str1, perl = TRUE) # "Rev." "Miss."
Or another option is to capture the characters as a group (
([^.]+\\.)) and in the replacement use the backreference (
\\1) of that capture group.
sub("^[^,]+,\\s+([^.]+\\.).*", "\\1", str1) # "Rev." "Miss."
str1 <- c("Montvila, Rev. Juozas", "Johnston, Miss. Catherine Helen")