jacky_learns_to_code jacky_learns_to_code - 2 months ago 17
R Question

Extract only 5-digit-number in a string

I have an address such that

81000
is the postal code (always a 5-digit-number).

address <- "F47, First Floor, PTD 106273, Persiaran Indahpura Utama, Bandar Indahpura, 81000 Kulaijaya, Johor"


I am trying to determine the postal code using
regex
and I have tried the following:

## postal code pattern
postal_pattern <- '\\d{5}'
## extract postal code
postal_code <- stringr::str_extract_all(address, postal_pattern)


However, I got the following output, which is partially correct:

> postal_code
[[1]]
[1] "10627" "81000"


How can I only extract
81000
using
regex
or any libraries?

Answer Source

I suggest extracting the last 5-digit number from the string:

> str_replace(address, ".*\\b(\\d{5})\\b.*", "\\1")
[1] "81000"

Or with base R sub:

> sub(".*\\b(\\d{5})\\b.*", "\\1", address)
[1] "81000"

Since .* matches all the string (line) and then starts backtracking to accommodate for the subsequent patterns, and thus the \d{5} will match the last 5-digit number (as a whole word).

Details

  • .* - any 0 or more chars (other than a line break in the stringr version, prepend the pattern with (?s) if you need to match line breaks, too), as many as possible, up to the last occurrence of the subsequent subpatterns
  • \\b - a leading word boundary (leading, because the following expected char is a digit)
  • (\\d{5}) - Group 1: five digits
  • \\b - a trailing word boundary
  • .* - the rest of the string (in the stringr version, prepend the pattern with (?s) if you need to match line breaks, too)