Mth Clv Mth Clv - 3 months ago 25
CSS Question

Python: BeautifulSoup: CSS rule to select elements only if have two classes and share the same first one

i have these elements in the HTML I want to parse:

<td class="line"> GARBAGE </td>
<td class="line text"> I WANT THAT </td>
<td class="line heading"> I WANT THAT </td>
<td class="line"> GARBAGE </td>


How can I make a CSS selector that select elements with attributes class line and class something else (could be heading, text or anything else) BUT not attribute class line only?

I have tried:

td[class=line.*]
td.line.*
td[class^=line.]


EDIT

I am using Python and BeautifulSoup:

url = 'http://www.somewebsite'
res = requests.get(url)
res.raise_for_status()
DicoSoup = bs4.BeautifulSoup(res.text, "lxml")
elems = DicoSoup.select('body div#someid tr td.line')


I am looking into modifying the last piece, namely td.line to something like td.line.whateverotherclass (but not td.line alone otherwise my selector would suffice already)

Thank you for your tips,

Answer

What @BoltClock suggested is generally a correct way to approach the problem with CSS selectors. The only problem is that BeautifulSoup supports a limited number of CSS selectors. For instance, not() selector is :not(.supported) at the moment.

You can workaround it with a "starts-with" selector to check if a class starts with line followed by a space (it is quite fragile but works on your sample data):

for td in soup.select("td[class^='line ']"):
    print(td.get_text(strip=True))

Or, you can solve it using the find_all() and having a searching function checking the class attribute to have line and some other class:

from bs4 import BeautifulSoup

data = """
<table>
    <tr>
        <td class="line"> GARBAGE </td>
        <td class="line text"> I WANT THAT </td>
        <td class="line heading"> I WANT THAT </td>
        <td class="line"> GARBAGE </td>
    </tr>
</table>"""
soup = BeautifulSoup(data, 'html.parser')

for td in soup.find_all(lambda tag: tag and tag.name == "td" and
                                    "class" in tag.attrs and "line" in tag["class"] and
                                    len(tag["class"]) > 1):
    print(td.get_text(strip=True))

Prints:

I WANT THAT
I WANT THAT