Bernardo Bernardo - 7 months ago 50
Python Question

Python Regex over list of strings

I'm trying to extract a url from a list of strings. Sample list:

import re
p = ['<img class="alignnone size-full wp-image-2087" src="http://www.sample.com/test.jpg" alt="0wCR41v" width="540" height="720" srcset="http://www.sample.com/test-225x300.jpg 225w, http://www.sample.com/test.jpg 540w" sizes="(max-width: 540px) 100vw, 540px" />', '<img class="alignnone size-large wp-image-2133" src="http://www.sample.com/test2.jpg" alt="NtAboHF" width="583" height="1024" srcset="http://www.happyfridaygents.com/wp-content/uploads/2016/04/NtAboHF-768x1349.jpg 768w, http://www.sample.com/test2.jpg 583w, http://www.happyfridaygents.com/wp-content/uploads/2016/04/NtAboHF.jpg 828w" sizes="(max-width: 583px) 100vw, 583px" />']


I'd like to extract the
http://www.sample.com/test.jpg
part that comes right after the src=" part.

I can use findall if p is just one string like so:

t = re.findall('src="(.+)" alt', p)
print t


But how can I iterate over the list and return a list of all the urls in P?

AKS AKS
Answer

This is a solution using BeautifulSoup:

>>> p = ['<img class="alignnone size-full wp-image-2087" src="http://www.sample.com/test.jpg" alt="0wCR41v" width="540" height="720" srcset="http://www.sample.com/test-225x300.jpg 225w, http://www.sample.com/test.jpg 540w" sizes="(max-width: 540px) 100vw, 540px" />', '<img class="alignnone size-large wp-image-2133" src="http://www.sample.com/test2.jpg" alt="NtAboHF" width="583" height="1024" srcset="http://www.happyfridaygents.com/wp-content/uploads/2016/04/NtAboHF-768x1349.jpg 768w, http://www.sample.com/test2.jpg 583w, http://www.happyfridaygents.com/wp-content/uploads/2016/04/NtAboHF.jpg 828w" sizes="(max-width: 583px) 100vw, 583px" />']

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(''.join(p), 'html.parser')
>>> src_links = [img['src'] for img in soup.find_all('img')]

>>> src_links
[u'http://www.sample.com/test.jpg', u'http://www.sample.com/test2.jpg']

If you do want to use regex:

>>> regex = re.compile(r'src="(.+)" alt')
>>> [regex.search(img).group(1) for img in p]
['http://www.sample.com/test.jpg', 'http://www.sample.com/test2.jpg']