Tahmid Khan Nafee Tahmid Khan Nafee - 7 months ago 72
Python Question

Using python regular expression to find an image path

I have a variable like the one below:

var = '<img src="path_1"><p>Words</p><img src="path_2>'


Its a string, but inside is obviously html elements. How do I get the first path only (i.e. path_1) using a regex?

I am trying something like this:

match = re.match(r'src=\"[\w-]+\"', var)
print match.group(0)


I get this error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'


Any help is appreciated.

Answer

You should use an HTML parser like BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> var = '<img src="path_1"><p>Words</p><img src="path_2>'
>>> soup = BeautifulSoup(var, "html.parser")
>>> soup.img["src"]
'path_1'

As for the regex-approach, you need to make the following changes to make it work:

  • switch to re.search(), re.match() starts matching from the beginning of the string
  • add a capturing group to capture the src value
  • there is no need to escape double quotes

Fixed version:

>>> re.search(r'src="([\w-]+)"', var).group(1)
'path_1'