PIMg021 PIMg021 - 10 months ago 62
Python Question

REGEX extracting specific part non greedy

I'm new to Python 2.7. Using regular expressions, I'm trying to extract from a text file just the emails from input lines. I am using the non-greedy method as the emails are repeated 2 times in the same line. Here is my code:

import re
f_hand = open('mail.txt')
for line in f_hand:
if re.findall('\S+@\S+?',line): print re.findall('\S+@\S+?',line)

however this is what i"m getting instead of just the email address:


What shall I use in
to get just the email out?

Answer Source

If you parse a simple file with anchors for email addresses and always the same syntax (like double quotes to enclose attributes), you can use:

for line in f_hand: 
    print re.findall(r'href="mailto:([^"@]+@[^"]+)">\1</a>', line)

(re.findall returns only the capture group. \1 stands for the content of the first capture group.)

If the file is a more complicated html file, use a parser, extract the links and filter them.
Or eventually use XPath, something like:
substring-after(//a/@href[starts-with(., "mailto:")], "mailto:")