SergeySD SergeySD - 4 months ago 21
Python Question

Regular expression for class with whitestaces using Beautifulsoup

I found that method BeautifulSoup.find() splits class attribute by whitespaces.
In that case I couldn't use regular expression as show in code below.
Could you somebody help me to get right way find all 'tree children' elements:

import re
from bs4 import BeautifulSoup

r_html = "<div class='root'>" \
"<div class='tree children1'>text children 1 </div>" \
"<div class='tree children2'>text children 2 </div>" \
"<div class='tree children3'>text children 3 </div>" \
"</div>"

bs_tab = BeautifulSoup(r_html, "html.parser")
workspace_box_visible = bs_tab.findAll('div', {'class':'tree children1'})
print workspace_box_visible # result: [<div class="tree children1">textchildren 1 </div>]
workspace_box_visible = bs_tab.findAll('div', {'class':re.compile('^tree children\d')})
print workspace_box_visible # result: [] >>>> empty array because
#class name was splited by whitespace character<<<<

# >>>>>> print all element classes <<<<<<<
def print_class(class_):
print class_
return False

workspace_box_visible = bs_tab.find('div', {'class': print_class})

# expected:
# root
# tree children1
# tree children2
# tree children3

# actual:
# root
# tree
# children1
# tree
# children2
# tree
# children3


Thanks in advance,

==== comments ==========

stackoverflow site don't allow add comments more than 500 letters,
so I added comments here:

Above, it was example to show how to BeautifulSoup looking for required classes.

But, If I have DOM structure like:

r_html = "<div class='root'>" \
"<div class='tree children'>zero</div>" \
"<div class='tree children first'>first</div>" \
"<div class='tree children second'>second</div>" \
"<div class='tree children third'>third</div>" \
"</div>"


and when need to select controls with class attributes: 'tree children' and 'tree children first',
All of the methods described in your(Padraic Cunningham) post isn't work.

I found a solution with using regex:

controls = bs_tab.findAll('div')
for control in controls:
if re.search("^tree children|^tree children first", " ".join(control.attrs['class'] if control.attrs.has_key('class') else "")):
print control


I know, it's not good solution. and I hope that BeautifulSoup module has appropriate method for that.

Answer

There are a few different ways depending on the structure of the html, they are css classes so you could just use class_=.. or a css selector using .select:

In [3]: bs_tab.find_all('div', class_="tree")
Out[3]: 
[<div class="tree children1">text children 1 </div>,
 <div class="tree children2">text children 2 </div>,
 <div class="tree children3">text children 3 </div>]

In [4]: bs_tab.select("div.tree")
Out[4]: 
[<div class="tree children1">text children 1 </div>,
 <div class="tree children2">text children 2 </div>,
 <div class="tree children3">text children 3 </div>]

But if you had another tree class elsewhere that would find then also.

You could use a selector to find divs that contains children in the class:

In [5]: bs_tab.select("div[class*=children]") 
Out[5]: 
[<div class="tree children1">text children 1 </div>,
 <div class="tree children2">text children 2 </div>,
 <div class="tree children3">text children 3 </div>]

But again if there were other tag classes with children in the name they would also be picked up.

You could be a bit more specific with a regex and look for children followed by one or more digits:

In [6]: bs_tab.find_all('div', class_=re.compile("children\d+"))
Out[6]: 
[<div class="tree children1">text children 1 </div>,
 <div class="tree children2">text children 2 </div>,
 <div class="tree children3">text children 3 </div>]

Or find all the div.tree's and see if the last names in tag["class"] starstwith children.

In [7]: [t for t in bs_tab.select("div.tree") if t["class"][-1].startswith("children")]
Out[7]: 
[<div class="tree children1">text children 1 </div>,
 <div class="tree children2">text children 2 </div>,
 <div class="tree children3">text children 3 </div>]

Or look for children and see if the first css class name is equal to tree

In [8]: [t for t in bs_tab.select("div[class*=children]") if t["class"][0] == "tree"]
Out[8]: 
[<div class="tree children1">text children 1 </div>,
 <div class="tree children2">text children 2 </div>,
<div class="tree children3">text children 3 </div>]