Nitin Nitin - 2 months ago 10
R Question

Need regular expression logic

I have a vector a as follows:

a <- c("Rs. 360 Rs. 540 [-33% ]", "Rs. 213 Rs. 250 [-15% ]", "Rs. 430 Rs. 1030 [-58% ]")


Need answer as below:

a should have
Rs.360, Rs.213, Rs.430


I have used:

a <- gsub(" Rs*", "", a)

Answer

You may use a regex with capturing groups that will grab the parts you need and using backreferences in the replacement pattern you may insert them back into the result:

sub("^\\s*(Rs\\.)\\s*(\\d+).*", "\\1\\2", a)

See the regex demo

The regex matches:

  • ^ - start of string
  • \\s* - zero or more whitespaces
  • (Rs\\.) - Group 1 capturing Rs. sequence
  • \\s* - 0+ whitespaces
  • (\\d+) - Group 2 caprturing 1 or more digits
  • .* - the rest of the string to its end

Tested code:

> a <- c("Rs. 360 Rs. 540 [-33% ]", "Rs. 213 Rs. 250 [-15% ]", "Rs. 430 Rs. 1030 [-58% ]")
> sub("^\\s*(Rs\\.)\\s*(\\d+).*", "\\1\\2", a)
[1] "Rs.360" "Rs.213" "Rs.430"

Update

For an input like a <- c(" 360 540", " 213 250"), use sub("^\\D*(\\d+).*", "\\1", a).

> a <- c(" 360 540", " 213 250")
> sub("^\\D*(\\d+).*", "\\1", a)
[1] "360" "213"

The ^\\D*(\\d+).* matches any amount of non-digit symbols at the start of the string, then captures 1+ digits into Group 1, and then .* matches the rest of the string.