gagandeep91 gagandeep91 - 4 months ago 13
R Question

Regular expression: matching multiple words

I am using regular expressions in R to extract strings from a variable. The variable contains distinct values that look like:

MEDIUM /REGULAR INSEAM

XX LARGE /SHORT INSEAM

SMALL /32" INSM

X LARGE /30" INSM

I have to capture two things: the value before the

/
as a whole(SMALL,XX LARGE) and the string(alphabetic or numeric) after it. I dont want the
" INSM
or the
INSEAM
part.

The regular expression for first two I am using is
([A-Z]\w+) \/([A-Z]\w+) INSEAM
and for the last two I am using
([A-Z]\w+) \/([0-9][0-9])[" INSM]
.
The part
([A-Z]\w+)
only captures one word, so it works fine for MEDIUM and SMALL, but fails for X LARGE, XX LARGE etc. Is there a way I can modify it to capture two occurances of word before the
/
character? Or is there a better way to do it?

Thanks in advance!

Answer

It seems you can use

(\w+(?: \w+)?) */ *(\w+)

See the regex demo

Pattern details:

  • (\w+(?: \w+)?) - Group 1 capturing one or more word chars followed with an optional sequence of a space + one or more word chars
  • */ * - a / enclosed with 0+ spaces
  • (\w+) - Group 2 capturing 1 or more word chars

R code with stringr:

> library(stringr)
> v <- c("MEDIUM /REGULAR INSEAM", "XX LARGE /SHORT INSEAM", "SMALL /32\" INSM", "X LARGE /30\" INSM")
> str_match(v, "(\\w+(?: \\w+)?) */ *(\\w+)")
     [,1]              [,2]       [,3]     
[1,] "MEDIUM /REGULAR" "MEDIUM"   "REGULAR"
[2,] "XX LARGE /SHORT" "XX LARGE" "SHORT"  
[3,] "SMALL /32"       "SMALL"    "32"     
[4,] "X LARGE /30"     "X LARGE"  "30"