Austin Piel Austin Piel - 1 year ago 83
HTML Question

Beautiful Soup: extracting picture url from webpage

So currently I'm having some issues trying to extract a picture URL from a web page using beautiful soup. I'm quite inexperienced with beautiful soup and would appreciate any feedback you have for me. Here is a snippet of the HTML I'm trying to extract the picture link from (more specifically, the data-srcset URL in the source media tag):

<div class="container-fluid" itemscope="" itemtype="http://schema.org/Product">

<div class="row">
<div id="js_carousel" class="col-xs-12 col-md-8">
<div id="psp-carousel" class="carousel_outer">
<div id="product-carousel" class="pdp-carousel carousel pdp-initial" style="display:block;">
<!-- Wrapper for slides -->
<div class="carousel-inner" id="carousel-inner" role="listbox">
<img class="product-image-placeholder" itemprop="image" alt="..." src="data:image/svg+xml;charset=utf-8,%3Csvg xmlns%3D'http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg' viewBox%3D'0 0 355 462'%3E %3Crect fill%3D'%23eee' width%3D'100%25' height%3D'100%25'%2F%3E%3C%2Fsvg%3E" width="355" height="462">
<picture class="item active" data-image="//s7d2.scene7.com/is/image/aeo/1162_8725_499_of" role="option" aria-selected="true" tabindex="0">
<source media="(max-width: 767px)" data-srcset="//s7d2.scene7.com/is/image/aeo/1162_8725_499_of?$pdp-main_small$" srcset="//s7d2.scene7.com/is/image/aeo/1162_8725_499_of?$pdp-main_small$">


Any time I try to use the line
my_imgs = page_soup.findAll('picture',{'class':'item active'})

I get an empty array. I apologize if this is a dumb question, but any help would be appreciated.

Answer Source

Have you tried using the .select() function for a bs4 instance? The documentation says that this is the prefered method for finding css elements in your HTML soup. So in this case use page_soup.select('picture[class="item active"]') instead of .findall() The .find() and .findAll() are for older versions of Beautiful Soup. And reading the documentation it seems like your code for the older versions should be formatted my_imgs = page_soup.findAll('picture', attrs ={'class':'item active'}) instead of my_imgs = page_soup.findAll('picture',{'class':'item active'}) you forgot to include the attrs part of the code to create a dictionary which beautiful soup then uses incase the data attributes that have names that can't be used as keyword arguments

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download