mobcity zkore mobcity zkore - 4 months ago 14
Python Question

how to use find all method from BS4 to scrape certain strings

<li class="sre" data-tn-component="asdf-search-result" id="85e08291696a3726" itemscope="" itemtype="http://schema.org/puppies">
<div class="sre-entry">
<div class="sre-side-bar">
</div>
<div class="sre-content">
<div class="clickable_asdf_card" onclick="window.open('/r/85e08291696a3726?sp=0', '_blank')" style="cursor: pointer;" target="_blank">


I need to grab the string '/r/85e08291696a3726?sp=0' which occurs throughout a page. I'm not sure how to use the soup.find_all method to do this. The strings that I need always occur next to '

This is what I was thinking (below) but obviously I am getting the parameters wrong. How would I format the find_all method to return the '/r/85e08291696a3726?sp=0' strings throughout the page?

for divsec in soup.find_all('div', class_='clickable_asdf_card'):
print('got links')
x=x+1


I read the documentation for bs4 and I was thinking about using find_all('clickable_asdf_card') to find all occurrences of the string I need but then what? Is there a way to adjust the parameters to return the string I need?

Answer

Use BeautifulSoup's built-in regular expression search to find and extract the desired substring from an onclick attribute value:

import re

pattern = re.compile(r"window\.open\('(.*?)', '_blank'\)")
for item in soup.find_all(onclick=pattern):
    print(pattern.search(item["onclick"]).group(1))

If there is just a single element you want to find, use find() instead of find_all().

Comments