Aaron Aaron - 6 months ago 28
Python Question

Find phone numbers in python script

the following python script allows me to scrape email addresses from a given file using regular expressions.

How could I add to this so that I can also get phone numbers? Say, if it was either the 7 digit or 10 digit (with area code), and also account for parenthesis?

My current script can be found below:

# filename variables
filename = 'file.txt'
newfilename = 'result.txt'

# read the file
if os.path.exists(filename):
data = open(filename,'r')
bulkemails = data.read()
else:
print "File not found."
raise SystemExit

# regex = something@whatever.xxx
r = re.compile(r'(\b[\w.]+@+[\w.]+.+[\w.]\b)')
results = r.findall(bulkemails)
emails = ""
for x in results:
emails += str(x)+"\n"

# function to write file
def writefile():
f = open(newfilename, 'w')
f.write(emails)
f.close()
print "File written."


Regex for phone numbers:

(\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4})


Another regex for phone numbers:

(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?

Answer

If you are interested in learning Regex, you could take a stab at writing it yourself. It's not quite as hard as it's made out to be. Sites like RegexPal allow you to enter some test data, then write and test a Regular Expression against that data. Using RegexPal, try adding some phone numbers in the various formats you expect to find them (with brackets, area codes, etc), grab a Regex cheatsheet and see how far you can get. If nothing else, it will help in reading other peoples Expressions.

Edit: Here is a modified version of your Regex, which should also match 7 and 10-digit phone numbers that lack any hyphens, spaces or dots. I added question marks after the character classes (the []s), which makes anything within them optional. I tested it in RegexPal, but as I'm still learning Regex, I'm not sure that it's perfect. Give it a try.

(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})

It matched the following values in RegexPal:

000-000-0000
000 000 0000
000.000.0000

(000)000-0000
(000)000 0000
(000)000.0000
(000) 000-0000
(000) 000 0000
(000) 000.0000

000-0000
000 0000
000.0000

0000000
0000000000
(000)0000000