user1893148 user1893148 - 1 month ago 5
Python Question

Extract email sub-strings from large document

I have a very large .txt file with hundreds of thousands of email addresses scattered throughout. They all take the format:

...<name@domain.com>...


What is the best way to have Python to cycle through the entire .txt file looking for a all instances of a certain @domain string, and then grab the entirety of the address within the <...>'s, and add it to a list? The trouble I have is with the variable length of different addresses.

Answer

This code extracts the email addresses in a string. Use it while reading line by line

>>> import re
>>> line = "why people don't know what regex are? let me know 321dsasdsa@dasdsa.com.lol"
>>> match = re.search(r'[\w\.-]+@[\w\.-]+', line)
>>> match.group(0)
'321dsasdsa@dasdsa.com.lol'

If you have several email addresses use findall:

>>> line = "why people don't know what regex are? let me know 321dsasdsa@dasdsa.com.lol   dssdadsa dadaads@dsdds.com"
>>> match = re.findall(r'[\w\.-]+@[\w\.-]+', line)
>>> match
['321dsasdsa@dasdsa.com.lol', 'dadaads@dsdds.com']

The regex above probably finds the most common non-fake email address. If you want to be completely aligned with the RFC 5322 you should check whose it defines the exact allowed patterns for email addresses. Check this out to avoid any bugs in finding email addresses correctly.