user3449212 user3449212 - 2 months ago 12
Python Question

Using regex extract all digit and word numbers

I am trying to extract all string and digit numbers from text.

text = 'one tweo three 10 number'
numbers = "(^a(?=\s)|one|two|three|four|five|six|seven|eight|nine|ten| \
eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen| \
eighteen|nineteen|twenty|thirty|forty|fifty|sixty|seventy|eighty| \
ninety|hundred|thousand)"

print re.search(numbers, text).group(0)


This gives me first words digit.

my expected result = ['one', 'two', 'three', '10']

How can I modify it so that all words and well digit numbers I Can get in list?

Answer

There are several issues here:

  • The pattern should be used with the VERBOSE flag (add (?x) at the start)
  • The nine will match nine in ninety, so you should either put the longer values first, or use word boundaries \b
  • Declare the pattern with a raw string literal to avoid issues like parsing \b as a backspace and not a word boundary
  • To match digits, you may add a |\d+ branch to your number matching group
  • To match multiple non-overlapping occurrences of the substrings inside the input string, you need to use re.findall (or re.finditer), not re.search.

Here is my suggestion:

import re
text = 'one two three 10 number eleven eighteen ninety  \n '
numbers = r"""(?x)
            (
              ^a(?=\s)
              |
              \b
              (?:
                  one|two|three|four|five|six|seven|eight|nine|ten| 
                  eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen| 
                  eighteen|nineteen|twenty|thirty|forty|fifty|sixty|seventy|eighty| 
                  ninety|hundred|thousand|\d+
              )
              \b
)"""

print(re.findall(numbers, text))

See Python demo

And here is a regex demo.