Veronica Wenqian Cheng Veronica Wenqian Cheng - 8 months ago 55
HTML Question

Is there a way to find the most appeared/common span style in beautifulsoup python?

As I need to proceed many pdfs with different styles, I have an assumptions that the main content will be under the most appeared/common span style.

Is there a way to find the most appeared span style in beautifulsoup python?

This is a command I used to find a specific span style

'font-family: CBCDEE+ArialMT; font-size:12px':
spans = soup.find_all('span',attrs={'style': 'font-family: CBCDEE+ArialMT; font-size:12px'})

Any ways to find the most appeared/common one? or basically, is there a way to have the span style list and count the appearance of different styles?

Many thanks.

Answer Source

You could use a Python Counter() to count all of the different styles and then display the most_common() element as follows:

from bs4 import BeautifulSoup
from collections import Counter

html = """
    <span style="font-family: CBCDEE+ArialMT; font-size:12px">1</span>
    <span style="font-family: CBCDEE+ArialMT; font-size:14px">2</span>
    <span style="font-family: CBCDEE+ArialMT; font-size:14px">3</span>
    <span style="font-family: CBCDEE+Arial; font-size:12px">4</span>
    <span style="font-family: CBCDEE+ArialMT; font-size:12px">5</span>"""

soup = BeautifulSoup(html, "html.parser")    
style_counts = Counter()

for span in soup.find_all('span', style=True):
    style_counts[span['style']] += 1

print style_counts.most_common(1)[0][0]

For this example it would display:

font-family: CBCDEE+ArialMT; font-size:12px