Giacomo Bonvini Giacomo Bonvini - 4 months ago 10
Python Question

I have a list of strings (html codes) and I want to extract all the emails in each of the string of my list

I have a list of strings:

urls = ["url1","url2","url3"]


in order to generate another list of strings:

for i in range (0,2):
htmlist = [urllib.urlopen(url[i]).read() for i in range(0,2) ]


When I try to extract the emails from the texts htmlist[i] with this code:

for i in range (0,2) :
emails = re.findall(r'[\w\.-]+@[\w\.-]+', htmlist[i])
print emails


the code only print emails in
htmlist[2]


Could you help me?
Thanks

Answer

That's because emails takes the value of the last iteration (at htmlist[2]). Move the print statement into the for loop to see emails at each iteration:

for i in range (0, 3) :
    emails = re.findall(r'[\w\.-]+@[\w\.-]+', htmlist[i])
    print emails

More so, the first iteration does not require range since you already have a list comprehension. You only need to change the stop index to 3, so you have htmmlist[0], htmmlist[1] and htmmlist[2]:

htmlist = [urllib.urlopen(url[i]).read() for i in range(0,3)]
#                                                       ^

Using range only repeats the initial iteration for as long the for loop runs. htmlist will only the last value from the loop. So the list comprehension is sufficient.


You can also use a list comprehension to keep all the emails from each url in a list:

htmlist = [urllib.urlopen(url[i]).read() for i in range(0,3)]

emails = [re.findall(r'[\w\.-]+@[\w\.-]+', htmlist[i]) for i in range(0,3)]
print emails