darh78 darh78 - 3 months ago 7
R Question

Extract just the part of string that matches a regex pattern in R

I build a data frame scrapped automatically from a web page on which one of the variables is a date in the text form “May 12”.

Nevertheless, sometimes observations came with some characters (in some cases weird ones) attached after the date, for example: “May 20 õ", "Dez 1", "Oct 12ABCdáé".
For those cases, I want to replace the value with the correct characters, thus: “Dec 24”, “Oct 1”.

After googling for a solution several times and trying functions like: sub, gsub and grep , I could not find the way to find a correct function to work.

I see that regular expressions has a steep learning curve, but after using the tool http://regexr.com/ I could define the regular expression to match the pattern in the observations where the problems appears. ([A-Z]{1}[a-z]{2})\s\d+.*

At this moment, I have the following example:

vector = c("May 20", "Dez 1", "Oct 12ABCdáé”)


And the last solution I tried is:

dateformat = gsub(pattern = "([A-Z]{1}[a-z]{2})\\s\\d+.*", replacement = "([A-Z]{1}[a-z]{2})\\s\\d+", x = vector)


But of course this gives me a replacement with the text string "([A-Z]{1}[a-z]{2})\s\d+” on each of them.

> dateformat
[1] "([A-Z]{1}[a-z]{2})sd+" "([A-Z]{1}[a-z]{2})sd+"
[3] "([A-Z]{1}[a-z]{2})sd+"


I really do not understand what I have to include in the replacement argument to remove the bad characters if they exists.

Answer

I added a capture group and a back-reference "\\1":

sub("^([A-Z]{1}[a-z]{2}\\s\\d+).*", "\\1", vector)
[1] "May 20" "Dez 1"  "Oct 12"

The replacement argument accepts back-references like '\\1', but not typical regex patterns as you used. The back-reference refers back to the pattern you created and the capture group you defined. In this case our capture group was the abbreviated month and day which we outlined with parantheticals (..). Any text captured within those brackets are returned when "\\1" is placed in the replacement argument.

This quick-start guide may help

Comments