Scrape Junkie Scrape Junkie - 1 year ago 96
Python Question

beautiful soup find_all, encompassing multiple class names

I am attempting to use the beautifulsoup find_all function within a for loop to return either one of two td elements with different classes. The td elements are within an html div element. There are multiple divs which are being iterated through by the for loop and each one will hold either one of two td elements with different classes.

My goal is to grab the text from within the td elements but I am having trouble finding a way to make it so both td classes are acceptable for the find_all function.

I want to use one find_all to grab either of these td elements, whichever one is present within the current div element.

sample html looks like this:

<td class='class1'>
text to scrape

<td class='class2'>
text to scrape

My code looks something like this:

for propbox in soup.find_all('div')
tester = propbox.find_all('td', {"class" : lambda A: A.contains("class1") or A.contains("class2")})

I am getting an error: AttributeError: 'NoneType' object has no attribute 'contains'

So I am assuming from this that when one td class is not present python is still trying to use .contains() on a None type which it doesnt like.

Does anyone know of a way I can achieve this? Any help/examples are much appreciated. Thanks in advance

Answer Source

The function is given each class attribute value (str); then whole class attribute value (unless no previous call returned for the element). But None is passed is passed argument if there no class attribute.

So you need to check None.

Or for you case simple in should be enough:

for propbox in soup.find_all('div'):
    tester = propbox.find_all('td', {
        "class": lambda class_: class_ in ("class1", "class2")
    # print(tester)

BTW, there's no contains method, but __contains__ method (in, membership test operator will use it):

>>> 'haystack'.contains('needle')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'contains'
>>> 'haystack'.__contains__('needle')
>>> 'needle' in 'haystack'
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download