Rahul Jain Rahul Jain - 1 month ago 11
Python Question

Python re.findall() returning empty list

I am trying to match some words with the regex and have written a python code for that. The weird thing is re.findall() is returning empty list on matches. However, the pattern and the text file show matches in regxr.com. Here is the code

pat1 = '(\S+)_(?:JJ)_\S+\b(?:\s+)(\S+)_(?:NN|NNS)_\S+\b'
pat2 = '(\S+?)_(?:RR|RBR|RBS)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat3 = '(\S+?)_(?:JJ)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat4 = '(\S+?)_(?:NN|NNS)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat5 = '(\S+?)_(?:RB|RBR|RBS)_\S+\b(?:\s+)(\S+?)_(?:VB|VBD|VBN|VBG)_\S+\b(?:\s+)\S*?_\S+?_\S+\b'

def process_file(content):
res = []
for line in content:
matches = re.findall(pat1,line)
for m in matches:
m = (m[0],m[1])
phrase = '%s %s' % m
res.append(phrase)
matches = re.findall(pat2,line)
for m in matches:
m = (m[0],m[1])
phrase = '%s %s' % m
res.append(phrase)
matches = re.findall(pat3,line)
for m in matches:
m = (m[0],m[1])
phrase = '%s %s' % m
res.append(phrase)
matches = re.findall(pat4,line)
for m in matches:
m = (m[0],m[1])
phrase = '%s %s' % m
res.append(phrase)
matches = re.findall(pat5,line)
for m in matches:
m = (m[0],m[1])
phrase = '%s %s' % m
res.append(phrase)
return res

def main(path):
contents = []
f = open(path)
for line in f:
contents.append(line)
f.close()
result = process_file(contents)
print result


This is the text file I am using:


sydney_NN_B-NP lumet_NN_I-NP is_VBZ_B-VP the_DT_B-NP director_NN_I-NP whose_WP$_B-NP work_NN_I-NP happens_VBZ_B-VP to_TO_I-VP be_VB_I-VP of_IN_B-PP varied_VBN_B-NP quality_NN_I-NP ._._B-O
he_PRP_B-NP is_VBZ_B-VP praised_VBN_I-VP for_IN_B-PP some_DT_B-NP of_IN_B-PP the_DT_B-NP most_RBS_I-NP important_JJ_I-NP films_NNS_I-NP of_IN_B-PP the_DT_B-NP previous_JJ_I-NP decades_NNS_I-NP ,_,_B-O like_IN_B-PP twelve_CD_B-NP angry_JJ_I-NP men_NNS_I-NP ,_,_B-O serpico_NN_B-NP or_CC_B-O the_DT_B-NP verdict_NN_I-NP ._._B-O
but_CC_B-O ,_,_I-O in_IN_B-PP the_DT_B-NP same_JJ_I-NP time_NN_I-NP ,_,_B-O almost_RB_B-NP any_DT_I-NP of_IN_B-PP such_JJ_B-NP pearls_NNS_I-NP is_VBZ_B-VP followed_VBN_I-VP by_IN_B-PP stinkers_NNS_B-NP that_WDT_B-NP hamper_VBP_B-VP lumet's_JJ_B-NP reputation_NN_I-NP ._._B-O
a_DT_B-NP stranger_NN_I-NP among_IN_B-PP us_PRP_B-NP ,_,_B-O 1992_CD_B-NP rip-off_NN_I-NP of_IN_B-PP peter_NN_B-NP weir's_JJ_I-NP witness_NN_I-NP ,_,_B-O belongs_VBZ_B-VP to_TO_B-PP the_DT_B-NP latter_NN_I-NP category_NN_I-NP ._._B-O
the_DT_B-NP heroine_NN_I-NP of_IN_B-PP this_DT_B-NP movie_NN_I-NP is_VBZ_B-VP emily_JJ_B-NP eden_FW_I-NP (_(_B-O melanie_JJ_B-NP griffith_NN_I-NP )_)B-O ,,_I-O tough_JJ_B-NP lady_NN_I-NP cop_NN_I-NP who_WP_B-NP sometimes_RB_B-ADVP shows_VBZ_B-VP too_RB_B-NP much_JJ_I-NP enthusiasm_NN_I-NP in_IN_B-PP battling_VBG_B-VP bad_JJ_B-NP guys_NNS_I-NP on_IN_B-PP the_DT_B-NP streets_NNS_I-NP of_IN_B-PP new_JJ_B-NP york_NN_I-NP ._._B-O
during_IN_B-PP one_CD_B-NP of_IN_B-PP such_JJ_B-NP actions_NNS_I-NP ,_,_B-O her_PRP$_B-NP partner_NN_I-NP nick_NN_I-NP (_(_B-O jamey_JJ_B-NP sheridan_NNS_I-NP )_)_B-O got_VBD_B-VP hurt_VBN_I-VP and_CC_B-O as_IN_B-PP a_DT_B-NP result_NN_I-NP ,_,_B-O she_PRP_B-NP becomes_VBZ_B-VP depressed_JJ_B-ADJP ._._B-O
in_IN_B-PP order_NN_B-NP to_TO_B-VP help_VB_I-VP her_PRP_B-NP recover_VB_B-VP ,_,_B-O bosses_NNS_B-NP give_VBP_B-VP her_PRP_B-NP rather_RB_I-NP easy_JJ_I-NP task_NN_I-NP of_IN_B-PP locating_VBG_B-VP missing_VBG_B-NP jeweller_NNS_I-NP who_WP_B-NP belonged_VBD_B-VP to_TO_B-PP hassidic_JJ_B-NP jew_NN_I-NP community_NN_I-NP ._._B-O
emily_NN_B-NP starts_VBZ_B-VP investigation_NN_B-NP and_CC_B-O soon_RB_B-VP realises_VBZ_I-VP that_IN_B-SBAR the_DT_B-NP case_NN_I-NP involves_VBZ_B-VP murder_NN_B-NP ._._B-O
concluding_VBG_B-VP that_IN_B-SBAR the_DT_B-NP perpetrator_NN_I-NP belongs_VBZ_B-VP to_TO_B-PP community_NN_B-NP ,_,_B-O she_PRP_B-NP decides_VBZ_B-VP to_TO_I-VP go_VB_I-VP undercover_JJ_B-ADJP ._._B-O
that_DT_B-NP isn't_RB_B-O easy_JJ_B-ADJP ,_,_B-O because_IN_B-SBAR her_PRP$_B-NP modern_JJ_I-NP manners_NNS_I-NP are_VBP_B-VP colliding_VBG_I-VP with_IN_B-PP traditionalist_NN_B-NP ways_NNS_I-NP ._._B-O
things_NNS_B-NP get_VBP_B-VP even_RB_B-NP more_RBR_B-ADJP complicated_JJ_I-ADJP when_WRB_B-ADVP she_PRP_B-NP develops_VBZ_B-VP feelings_NNS_B-NP for_IN_B-PP young_JJ_B-NP cabalistic_JJ_I-NP scholar_NN_I-NP ariel_NN_I-NP (_(_B-O eric_JJ_B-NP thal_NN_I-NP )_)B-O .._I-O
using_VBG_B-VP peter_NN_B-NP weir's_JJ_I-NP formula_NN_I-NP isn't_:_B-O the_DT_B-NP greatest_JJS_I-NP flaw_NN_I-NP of_IN_B-PP this_DT_B-NP film_NN_I-NP ._._B-O
even_RB_B-NP the_DT_I-NP lame_JJ_I-NP and_CC_I-NP unispiring_JJ_I-NP crime_NN_I-NP mystery_NN_I-NP subplot_NN_I-NP works_VBZ_B-VP to_TO_B-PP the_DT_B-NP certain_JJ_I-NP extent_NN_I-NP ._._B-O
but_CC_B-O the_DT_B-NP worst_JJS_I-NP insult_NN_I-NP to_TO_B-PP viewer's_JJ_B-NP audience_NN_I-NP is_VBZ_B-VP terrible_JJ_B-NP miscasting_NN_I-NP of_IN_B-PP melanie_JJ_B-NP griffith_NN_I-NP ._._B-O
the_DT_B-NP author_NN_I-NP of_IN_B-PP this_DT_B-NP review_NN_I-NP never_RB_B-ADVP liked_VBD_B-VP this_DT_B-NP actress_NN_I-NP very_RB_B-ADVP much_RB_I-ADVP ,_,_B-O but_CC_I-O she_PRP_B-NP was_VBD_B-VP at_IN_B-ADVP least_JJS_I-ADVP tolerable_JJ_B-ADJP in_IN_B-PP some_DT_B-NP of_IN_B-PP her_PRP$_B-NP roles_NNS_I-NP ._._B-O
role_NN_B-NP of_IN_B-PP emily_JJ_B-NP eden_NNS_I-NP ,_,_B-O unfortunately_RB_B-ADVP ,_,_B-O isn't_VBZ_I-O one_CD_B-NP of_IN_B-PP them_PRP_B-NP ._._B-O
first_RB_B-ADVP of_IN_B-PP all_DT_B-NP ,_,_B-O she_PRP_B-NP can't_MD_B-VP pass_VB_I-VP for_IN_B-PP tough_JJ_B-NP nypd_JJ_I-NP street_NN_I-NP fighter_NN_I-NP ,_,_B-O and_CC_I-O her_PRP$_B-NP attempt_NN_I-NP to_TO_B-VP pass_VB_I-VP for_IN_B-PP orthodox_JJ_B-NP jewish_JJ_I-NP woman_NN_I-NP isn't_RB_B-O much_RB_B-ADJP better_JJR_I-ADJP ._._B-O
screenplay_NN_B-NP by_IN_B-PP robert_JJ_B-NP j_NN_I-NP ._._B-O avrech_NNS_B-NP makes_VBZ_B-VP things_NNS_B-NP even_RB_B-ADJP worse_JJR_I-ADJP with_IN_B-PP some_DT_B-NP formulaic_JJ_I-NP red_JJ_I-NP herring_NN_I-NP subplots_NNS_I-NP (_(_B-O scene_NN_B-NP involving_VBG_B-VP two_CD_B-NP italian_JJ_I-NP gangsters_NNS_I-NP was_VBD_B-VP almost_RB_B-ADJP too_RB_I-ADJP painful_JJ_I-ADJP to_TO_B-VP watch_VB_I-VP )_)B-O .._I-O
but_CC_B-O ,_,_I-O on_IN_B-PP the_DT_B-NP other_JJ_I-NP hand_NN_I-NP ,_,_B-O other_JJ_B-NP actors_NNS_I-NP are_VBP_B-VP more_RBR_B-ADJP convincing_JJ_I-ADJP (_(_B-O lee_NN_B-NP richardson_NN_I-NP as_IN_B-PP an_DT_B-NP old_JJ_I-NP rabbi_NN_I-NP ,_,_B-O thal_JJ_B-ADJP as_IN_B-PP ariel_NN_B-NP and_CC_B-O charming_JJ_B-NP mia_NN_I-NP sara_NN_I-NP as_IN_B-PP his_PRP$_B-NP intended_VBN_I-NP bride_NN_I-NP )_)B-O ,,_I-O and_CC_I-O the_DT_B-NP photography_NN_I-NP by_IN_B-PP andrzej_JJ_B-NP bartkowiak_NN_I-NP very_RB_B-ADVP effectively_RB_I-ADVP creates_VBZ_B-VP atmosphere_NN_B-NP of_IN_B-PP warmth_NN_B-NP when_WRB_B-ADVP the_DT_B-NP scenes_NNS_I-NP take_VBP_B-VP place_NN_B-NP in_IN_B-PP hassidic_JJ_B-NP community_NN_I-NP ._._B-O
also_RB_B-ADVP ,_,_B-O the_DT_B-NP film_NN_I-NP might_MD_B-VP educate_VB_I-VP viewers_NNS_B-NP about_IN_B-PP hassidic_JJ_B-NP culture_NN_I-NP ._._B-O
that_DT_B-NP is_VBZ_B-VP the_DT_B-NP only_JJ_I-NP thing_NN_I-NP that_WDT_B-NP prevents_VBZ_B-VP it_PRP_B-NP from_IN_B-PP turning_VBG_B-VP into_IN_B-PP total_JJ_B-NP waste_NN_I-NP of_IN_B-PP time_NN_B-NP ._._B-O

Answer

You are being bitten by backslashes! The backslash is used to as an escape character in Python strings (as in many other languages). For example, \n means "newline" and \r means "carriage return"...and \b means "backspace", aka \x08.

And you have \b in all of your expressions.

So when you write:

>>> pat1 = '...\b...'

You get:

>>> pat1
'...\x08...'

There are two ways to fix this. You can escape each backslash with another backslash, like this:

>>> pat1 = '...\\b...'
>>> pat1
'...\\b...'

And note that you see a \\ there because that is the Python representation of the string; if we were to print out pat1 we get:

>>> print pat1
...\b...

The easier way to fix that is to mark you regular expression strings as "raw strings":

The backslash () character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character. String literals may optionally be prefixed with a letter r' orR'; such strings are called raw strings and use different rules for backslash escape sequences.

In other words:

pat1 = r'(\S+)_(?:JJ)_\S+\b(?:\s+)(\S+)_(?:NN|NNS)_\S+\b'
pat2 = r'(\S+?)_(?:RR|RBR|RBS)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat3 = r'(\S+?)_(?:JJ)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat4 = r'(\S+?)_(?:NN|NNS)_\S+\b(?:\s+)(\S+?)_(?:JJ)_\S+\b(?:\s+)(?!\S*?_(?:NN|NNS)_\S+\b)'
pat5 = r'(\S+?)_(?:RB|RBR|RBS)_\S+\b(?:\s+)(\S+?)_(?:VB|VBD|VBN|VBG)_\S+\b(?:\s+)\S*?_\S+?_\S+\b'

With that change in place, I get matches using your sample data:

>>> re.findall(pat1, data)
[('important', 'films'), ('previous', 'decades'), ('angry', 'men'), ('same', 'time'), ('such', 'pearls'), ("lumet's", 'reputation'), ("weir's", 'witness'), ('melanie', 'griffith'), ('tough', 'lady'), ('much', 'enthusiasm'), ('bad', 'guys'), ('new', 'york'), ('such', 'actions'), ('jamey', 'sheridan'), ('easy', 'task'), ('hassidic', 'jew'), ('modern', 'manners'), ('cabalistic', 'scholar'), ('eric', 'thal'), ("weir's", 'formula'), ('unispiring', 'crime'), ('certain', 'extent'), ("viewer's", 'audience'), ('terrible', 'miscasting'), ('melanie', 'griffith'), ('emily', 'eden')]