A. Garnisz A. Garnisz - 2 months ago 16
Python Question

Parsing numbers in a given range with pyparsing

How to extract numbers in a given range using pyparsing?
I tried:

# Number lower than 12:
number = Word(nums).addCondition(lambda tokens: int(tokens[0]) < 12)

test_data = "10 23 11 14 115"
print number.searchString(test_data)


but it returns:

[['10'], ['3'], ['11'], ['4'], ['5']]


What I want is:

[['10'], ['11']]


More specified example:
I want to extract all numbers that looks like part of a date and ignore others.
So, from this input:

"""
This is a date: 12 03 2008
This too: 03 12 2008
And this not, values are too large: 123 333 11
"""


I want to get:

[[12, 3, 2008], [3, 12, 2008]]

Answer

The main issue here is that searchString (and the underlying scanString) go through the input string character by character looking for matches. So in your input (with position header for reference):

          1
012345678901234 <- position
10 23 11 14 115

searchString goes through the following steps:

  • finds number "10" at position 0, this matches the "less than 12" condition, and so this is a match
  • advance to position 2
  • skipping whitespace, advance to position 3
  • finds number "23" at position 3, but this fails the condition
  • advance one place to position 3
  • finds number "3", this matches the condition, so is accepted as a match
  • finds number "11", this is a match, advance to position 8
  • skips whitespace, advance to position 9
  • finds number "14", this fails the condition
  • advance one place to position 10
  • finds number "4", this passes the condition, so accepted as a match
  • advances and finds number "115", and fails
  • advances one place and finds the number "15", and fails
  • advances one place and finds the number "5", and accepts as a match

giving the results as you posted, [['10'], ['3'], ['11'], ['4'], ['5']].

A quick solution is to change your definition of number to add asKeyword=True:

number = Word(nums, asKeyword=True)

As keyword forces the expression to match only if at the beginning of a space-separated word. In your case, this will prevent accidental parsing of the '3' in '23', and the '4' in '14', etc. This will give your desired result of [['10'], ['11']].