Mth Clv Mth Clv - 3 months ago 10
Python Question

Python, bs4: Tags in inspection are nowhere to be found when parsing

I run into an unexpected problem, I am using Python 3.5 and BeautifulSoup.
I want to parse the following link:

url = 'https://www.leboncoin.fr/chaussures/627533472.htm?ca=16_s'
import requests, bs4
res = requests.get(url)
res.raise_for_status()
DicoSoup = bs4.BeautifulSoup(res.text, "lxml")


I am interested in retrieving the link to the pictures that are in the offer.
When I inspect the html of the website I found that there are to be found under the tag div with class 'thumbnails', they are under tag span with class 'item_imagePic', they are img tags

However, when I select the div tag, the span tags are nowhere to be found:

div = DicoSoup.select("div.thumbnails")

div
Out[54]:
[<div class="thumbnails" data-alt="Talons aiguilles Stéphane Kélian - 37.5">
<ul>
<li class="thumb selected trackable" data-info='{"event_name" : "ad_view::photos", "event_type" : "click", "click_type" : "N", "event_s2" : "2"}' id="thumb_0"></li>
<li class="thumb trackable" data-info='{"event_name" : "ad_view::photos", "event_type" : "click", "click_type" : "N", "event_s2" : "2"}' id="thumb_1"> </li>
<li class="thumb trackable" data-info='{"event_name" : "ad_view::photos", "event_type" : "click", "click_type" : "N", "event_s2" : "2"}' id="thumb_2"></li>
</ul>
</div>]


When I inspect the html content, here is what I see:

<div class="thumbnails" data-alt="Talons aiguilles Stéphane Kélian - 37.5" style="width: 596px;">
<ul style="">

<li id="thumb_0" class="thumb selected trackable" data-info="{&quot;event_name&quot; : &quot;ad_view::photos&quot;, &quot;event_type&quot; : &quot;click&quot;, &quot;click_type&quot; : &quot;N&quot;, &quot;event_s2&quot; : &quot;2&quot;}"><span class="item_imagePic"><img src="//img0.leboncoin.fr/thumbs/d89/d89c778e852e4a175d5d1ba96b2ec9c220445732.jpg" alt="Talons aiguilles Stéphane Kélian - 37.5"></span></li>

<li id="thumb_1" class="thumb trackable" data-info="{&quot;event_name&quot; : &quot;ad_view::photos&quot;, &quot;event_type&quot; : &quot;click&quot;, &quot;click_type&quot; : &quot;N&quot;, &quot;event_s2&quot; : &quot;2&quot;}"><span class="item_imagePic"><img src="//img1.leboncoin.fr/thumbs/7d9/7d9b62d9efd2187472dc16ca2794be1bbaeb1370.jpg" alt="Talons aiguilles Stéphane Kélian - 37.5"></span></li>

<li id="thumb_2" class="thumb trackable" data-info="{&quot;event_name&quot; : &quot;ad_view::photos&quot;, &quot;event_type&quot; : &quot;click&quot;, &quot;click_type&quot; : &quot;N&quot;, &quot;event_s2&quot; : &quot;2&quot;}"><span class="item_imagePic"><img src="//img2.leboncoin.fr/thumbs/288/28865002bb34bad516574bd1e9b42d2a2bb928f2.jpg" alt="Talons aiguilles Stéphane Kélian - 37.5"></span></li>

</ul>
</div>


How is it possible?
What do I need to do to select them?

I have tried:

div = DicoSoup.select_one("div.thumbnails span.item_imagePic")
div = DicoSoup.select_one("div.thumbnails ul li span.item_imagePic")
div = DicoSoup.select("div.thumbnails ul li span.item_imagePic")
span = DicoSoup.find('span', {'class': 'item_imagePic'})
span = DicoSoup.find('span',id="thumb_0")
div = DicoSoup.select("div.thumbnails img")
div = DicoSoup.select("div.thumbnails span img")
div = DicoSoup.select("div.thumbnails ul li span.item_imagePic img")


They all return me objects of type 'NoneType'

Thanks,

Answer

As I commented the thumbnails are dynamically generated using JS but you can get the script and parse the paths:

soup = BeautifulSoup(requests.get("https://www.leboncoin.fr/chaussures/627533472.htm?ca=16_s").content)

script = soup.select_one("div.thumbnails").find_next("script")
print(script.text.strip())

That gives you:

var images = new Array(), images_thumbs = new Array();
                        images_thumbs[0] = "//img0.leboncoin.fr/thumbs/d89/d89c778e852e4a175d5d1ba96b2ec9c220445732.jpg"; 
              images[0] = "//img0.leboncoin.fr/images/d89/d89c778e852e4a175d5d1ba96b2ec9c220445732.jpg";

                        images_thumbs[1] = "//img1.leboncoin.fr/thumbs/7d9/7d9b62d9efd2187472dc16ca2794be1bbaeb1370.jpg"; 
              images[1] = "//img1.leboncoin.fr/images/7d9/7d9b62d9efd2187472dc16ca2794be1bbaeb1370.jpg";

                        images_thumbs[2] = "//img2.leboncoin.fr/thumbs/288/28865002bb34bad516574bd1e9b42d2a2bb928f2.jpg"; 
              images[2] = "//img2.leboncoin.fr/images/288/28865002bb34bad516574bd1e9b42d2a2bb928f2.jpg";

To get the images links:

import re


soup = BeautifulSoup(requests.get("https://www.leboncoin.fr/chaussures/627533472.htm?ca=16_s").content)

script = soup.select_one("div.thumbnails").find_next("script").text

print(re.findall("images_thumbs\[\d+\]\s+=\s+\"(.*?)\";", script))

Or just splitlines and strip:

 [s.split("=", 1)[1].strip('"; ') for s in script.splitlines() if s.strip().startswith("images_thumbs")]

Both give you:

[u'//img0.leboncoin.fr/thumbs/d89/d89c778e852e4a175d5d1ba96b2ec9c220445732.jpg', u'//img1.leboncoin.fr/thumbs/7d9/7d9b62d9efd2187472dc16ca2794be1bbaeb1370.jpg', u'//img2.leboncoin.fr/thumbs/288/28865002bb34bad516574bd1e9b42d2a2bb928f2.jpg']
[u'//img0.leboncoin.fr/thumbs/d89/d89c778e852e4a175d5d1ba96b2ec9c220445732.jpg', u'//img1.leboncoin.fr/thumbs/7d9/7d9b62d9efd2187472dc16ca2794be1bbaeb1370.jpg', u'//img2.leboncoin.fr/thumbs/288/28865002bb34bad516574bd1e9b42d2a2bb928f2.jpg']

Last all you need is to prepend the scheme which is https:

 ["https://"+ path for path in re.findall("images_thumbs\[\d+\]\s+=\s+\"(.*?)\";", script)]