user3191569 user3191569 - 1 month ago 15
Python Question

nltk custom grammar for chunking dates using RegexpParser

Using the information extraction from this blog post, I'm trying to define a grammar that includes the addition of dates as a new chunk with the following grammar;

grammar = r"""
NBAR:
{<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns

NP:
{<NBAR>}
{<NBAR><IN><NBAR>} # Above, connected with in/of/etc...
DATE -> MONTH SEP DAY SEP YEAR
SEP -> "/"
MONTH -> DIGIT | DIGIT DIGIT
DAY -> DIGIT | DIGIT DIGIT
YEAR -> DIGIT DIGIT DIGIT DIGIT
DIGIT -> '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | '0'


But this throws an illegal chunk pattern when I call
chunker = nltk.RegexpParser(grammar)
, Any ideas of how I can include the dates which are always represented as 8 digits
DD/MM/YYYY
or in the long form where the month is spelled out and the date is followed by the ordinal indicator
st,nd, or th
so that the result would be
DDthMONTHYYYY
.

Answer

You are mixing apples and oranges. Only your first two expansions are valid nltk RegexpParser rules, so you get an error on the third. Convert the rest to the same format: Change the separator from -> to :, then write the expansions as RegexpParser expressions. Note that you are working with a chunker, not a hierarchical parser. (See the above documentation, and also all of Chapter 7 of the NLTK book.)