ahabos ahabos - 1 year ago 105
Python Question

Matching dates with regular expressions in Python?

I know that there are similar questions to mine that have been answered, but after reading through them I still don't have the solution I'm looking for.

Using Python 3.2.2, I need to match "Month, Day, Year" with the Month being a string, Day being two digits not over 30, 31, or 28 for February and 29 for February on a leap year. (Basically a REAL and Valid date)

This is what I have so far:

pattern = "(January|February|March|April|May|June|July|August|September|October|November|December)[,][ ](0[1-9]|[12][0-9]|3[01])[,][ ]((19|20)[0-9][0-9])"
expression = re.compile(pattern)
matches = expression.findall(sampleTextFile)

I'm still not too familiar with regex syntax so I may have characters in there that are unnecessary (the [,][ ] for the comma and spaces feels like the wrong way to go about it), but when I try to match "January, 26, 1991" in my sample text file, the printing out of the items in "matches" is ('January', '26', '1991', '19').

Why does the extra '19' appear at the end?

Also, what things could I add to or change in my regex that would allow me to validate dates properly? My plan right now is to accept nearly all dates and weed them out later using high level constructs by comparing the day grouping with the month and year grouping to see if the day should be <31,30,29,28

Any help would be much appreciated including constructive criticism on how I am going about designing my regex.

Answer Source

Here's one way to make a regular expression that will match any date of your desired format (though you could obviously tweak whether commas are optional, add month abbreviations, and so on):

years = r'((?:19|20)\d\d)'
pattern = r'(%%s) +(%%s), *%s' % years

thirties = pattern % (

thirtyones = pattern % (

fours = '(?:%s)' % '|'.join('%02d' % x for x in range(4, 100, 4))

feb = r'(February) +(?:%s|%s)' % (
     r'(?:(0?[1-9]|1\d|2[0-8]), *%s' % years, # 1-28 any year
     r'(?:(29), *((?:(?:19|20)%s)|2000))' % fours)  # 29 leap years only

result = '|'.join('(?:%s)' % x for x in (thirties, thirtyones, feb))
r = re.compile(result)
print result

Then we have:

>>> r.match('January 30, 2001') is not None
>>> r.match('January 31, 2001') is not None
>>> r.match('January 32, 2001') is not None
>>> r.match('February 32, 2001') is not None
>>> r.match('February 29, 2001') is not None
>>> r.match('February 28, 2001') is not None
>>> r.match('February 29, 2000') is not None
>>> r.match('April 30, 1908') is not None
>>> r.match('April 31, 1908') is not None

And what is this glorious regexp, you may ask?

>>> print result
(?:(September|April|June|November) +(0?[1-9]|[12]\d|30), *((?:19|20)\d\d))|(?:(January|March|May|July|August|October|December) +(0?[1-9]|[12]\d|3[01]), *((?:19|20)\d\d))|(?:February +(?:(?:(0?[1-9]|1\d|2[0-8]), *((?:19|20)\d\d))|(?:(29), *((?:(?:19|20)(?:04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96))|2000))))

(I initially intended to do a tongue-in-cheek enumeration of the possible dates, but I basically ended up hand-writing that whole gross thing except for the multiples of four, anyway.)

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download