hejseb hejseb - 1 year ago 90
R Question

regex for pattern between comma and period

After hours of googling and fruitless attempts, I'm hoping someone can help with this admittedly easy question (although regexps are fairly unfamiliar to me evidently).

I have the following type of data:

name <- c("Doe, Mr. John")

and I want "Mr" from this, but the actual title varies. My main question is how I write regular expression in order to capture just the "Mr" part, without anything else?

My current approach is as follows:

str_split(name, "[,\\s.]")[[1]][[3]]

and the best I managed to do using extraction was this:

str_extract(name, ", .*\\.")

I'm sure there's a simpler way, can anyone help me?

Answer Source

You may match all letters before a dot:

> name <- c("Doe, Mr. John")
> str_extract(name, "\\p{L}+(?=\\.)")
[1] "Mr"

Where \\p{L}+ matches 1 or more letters and (?=\\.) is a positive lookahead requiring a dot right after them.

The same can be done with base R regmatches / regexpr using a PCRE regex (by passing a perl=TRUE argument to regexpr):

> regmatches(name, regexpr("\\p{L}+(?=\\.)", name, perl=TRUE))
[1] "Mr"

A similar regex can be be used with a str_match to ensure we only match the word after a comma, whitespaces and right before a dot:

> str_match(name, ",\\s*(\\p{L}+)\\.")[,2]
[1] "Mr"
