satch_boogie satch_boogie - 6 months ago 24
HTML Question

xpath to extract link or hrefs

I am trying to extract the links of similar apps from google playstore from here( using xpath )

https://play.google.com/store/apps/details?id=com.mojang.minecraftpe


Below is the screenshot of the links(marked green) which i wanted to extract
enter image description here

HTML sample

<div class="details">
<a href="/store/apps/details?id=com.imangi.templerun" class="card-click-target"></a>
<a title="Temple Run" href="/store/apps/details?id=com.imangi.templerun" class="title">Temple Run
<span class="paragraph-end"/>
</a>
<div>....</div>
<div>....</div>
</div>


I have used below xpath in
chrome console
to locate a single link but it doesnt return the href attribute of the tag. but for other attributes it works(for example "title").

Below xpath doesnt work(extract "href")

//*[@id="body-content"]/div/div/div[2]/div[1]//*/a[2]/@href


Below xpath works(extract "title")

//*[@id="body-content"]/div/div/div[2]/div[1]//*/a[2]/@title


enter image description here

Python code

Answer

HTML of individual tiles on the right of the linked page is in the following form * :

<div class="details"> 
  <a href="/store/apps/details?id=com.imangi.templerun" class="card-click-target"></a>  
  <a title="Temple Run" href="/store/apps/details?id=com.imangi.templerun" class="title">Temple Run 
    <span class="paragraph-end"/> 
  </a>  
  <div>....</div>  
  <div>....</div> 
</div>

Turned out that <a> element with class="title" uniquely identify your target <a> elements in that page. So the XPath can be as simple as :

//a[@class="title"]/@href

Anyway, the problem you noticed seems to be specific to the Chrome XPath evaluator **. Since you mentioned about Python, simple Python codes proves that the XPath should work just fine :

>>> from urllib2 import urlopen
>>> from lxml import html
>>> req = urlopen('https://play.google.com/store/apps/details?id=com.mojang.minecraftpe')
>>> raw = req.read()
>>> root = html.fromstring(raw)
>>> [h for h in root.xpath("//a[@class='title']/@href")]
['/store/apps/details?id=com.imangi.templerun', '/store/apps/details?id=com.lego.superheroes.dccomicsteamup', '/store/apps/details?id=com.turner.freefurall', '/store/apps/details?id=com.mtvn.Nickelodeon.GameOn', '/store/apps/details?id=com.disney.disneycrossyroad_goo', '/store/apps/details?id=com.rovio.angrybirdsstarwars.ads.iap', '/store/apps/details?id=com.rovio.angrybirdstransformers', '/store/apps/details?id=com.disney.dinostampede_goo', '/store/apps/details?id=com.turner.atskisafari', '/store/apps/details?id=com.moose.shopville', '/store/apps/details?id=com.DisneyDigitalBooks.SevenDMineTrain', '/store/apps/details?id=com.turner.copatoon', '/store/apps/details?id=com.turner.wbb2016', '/store/apps/details?id=com.tov.google.ben10Xenodrome', '/store/apps/details?id=com.turner.ggl.gumballrainbowruckus', '/store/apps/details?id=com.lego.starwars.theyodachronicles', '/store/apps/details?id=com.mojang.scrolls']

*) Stripped down version. You can take this as an example of providing minimal HTML sample.

**) I can reproduce this problem, that @hrefs are printed as empty string in my Chrome console. The same problem happened to others as well : Chrome element inspector Xpath with @href won't show link text