Furkanicus Furkanicus -4 years ago 83
Python Question

Beautifulsoup finding a specific value in meta tags

I'm trying to find all the meta tags that have author in it. It works if I have a specific key and Regex value. It doesn't work when both are Regex. Is it possible to extract all the meta tags on the page containing "author" keyword in it?
This is the code I wrote.

from bs4 import BeautifulSoup
page = requests.get(url)
contents = page.content
soup = BeautifulSoup(contents, 'lxml')
preys = soup.find_all("meta", attrs={re.compile('.*'): re.compile('author')})


Edit:
For clarification, the problem I am trying to solve specifically is if the value "author" is mapped to any key. That key could be "itemprop", "name" or even "property" as I have seen in various examples. Basically, my problem is pulling all the meta tags that has author as a value in it regardless what key that value has.
A couple examples that are the case:

<meta content="Jami Miscik" name="citation_author"/>
<meta content="Will Ripley, Joshua Berlinger and Allison Brennan, CNN" itemprop="author"/>
<meta content="Alison Griswold" property="author"/>

Answer Source

If you're looking for citation_author or author, you might get along with a combination of soup.select() and a regular expression:

from bs4 import BeautifulSoup
import re

# some test string
html = '''
<meta name="author" content="Anna Lyse">
<meta name="date" content="2010-05-15T08:49:37+02:00">
<meta itemprop="author" content="2010-05-15T08:49:37+02:00">
<meta rel="author" content="2010-05-15T08:49:37+02:00">
<meta content="Jami Miscik" name="citation_author"/>
<meta content="Will Ripley, Joshua Berlinger and Allison Brennan, CNN" itemprop="author"/>
<meta content="Alison Griswold" property="author"/>
'''

soup = BeautifulSoup(html, 'html5lib')

rx = re.compile(r'(?<=)"(?:citation_)?author"')

authors = [author 
            for author in soup.select("meta")
            if rx.search(str(author))]

print(authors)
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download