Vas Vas - 2 months ago 12
Ruby Question

Fix regex to extract specific number formats

Ideally my regex should capture/extract all the following number formats:

500 /
500.55 /
500k /
500.55k /
500 to 600 /
500k to 600k /
500 to 600k /
500.55 to 600.55 /
500.55 to 600.55 k

I have a problem with my current regex, because if numbers like "700,000" or "800,000" or "8.54" are in the text then it splits up the numbers and captures:

700,000 => "700","000"
800,000. => "800" , "000." , "8.", "54"
8.54 => "8.", "54"


Any ideas what to change? Current regex:

(\d+(?:\.?\d*)?\s*k?(?:\-|to)\s*\d+(?:\.?\d*)\s*k?|\d+(?:\.?\d*)\s*k?)

Answer

I suggest using a bit more optional groups instead of consecutive optional atoms, and use [,.] character class instead of \. to allow 2 separators, and \p{Pd} to match any dashes:

/\d+(?:[.,]\d+)*(?:\s*k)?(?:\s*(?:\p{Pd}|to)\s*\d+(?:[.,]\d+‌​)*(?:\s*k)?)?/i

See the Rubular demo

If you want to make it more precise, the (?:[.,]\d+)* should be split into (?:\.\d+)*(?:\.\d+)?

/\d+(?:\.\d+)*(?:\.\d+)?(?:\s*k)?(?:\s*(?:\p{Pd}|to)\s*\d+(?:\.\d+)*(?:\.\d+)?(?:\s*k)?)?/i

Details:

  • \d+ - 1 or more digits
  • (?:[.,]\d+)* - 0+ sequences of . or , with 1 or more digits after
  • (?:\s*k)? - an optional sequence of 0+ whitespace + k / K
  • (?:\s*(?:\p{Pd}|to)\s*\d+(?:[.,]\d+‌​)?(?:\s*k)?)? - an optional sequence of:
    • \s*(?:\p{Pd}|to)\s* - any dash (\p{Pd}) or to enclosed with 0+ whitespaces
    • \d+(?:[.,]\d+‌​)*(?:\s*k)? - see above.
Comments