MastaBot MastaBot - 10 months ago 77
Python Question

select multiple value using python and xpath

I can without problem select single value using xpath in python but how to join few single xpath to get one?

here is sample fragment of html source (

r.content
):

<div class="members">
<h2>Members</h2>
<div class="member">
<span title="Last Online:&nbsp;2017-02-20 22:37:42" data-time="2017-02-20T22:37:42Z">
<span class="profile-link">
<a href="/account/view-profile/KonterBolet">
<img class="achievement" src="36.png" alt="Completed 36" title="Completed 36">KonterA</a>
</span>
<span class="memberType">Leader</span>
</span>
</div>
<div class="member">
<span title="Last Online:&nbsp;2017-02-19 11:28:20" data-time="2017-02-19T11:28:20Z">
<span class="profile-link hasTwitch twitchOffline" data-twitch-user="mardok_tv">
<a href="/account/view-profile/mardok">
<img class="achievement" src="35.png" alt="Completed 35" title="Completed 35">mardok</a>
<a class="twitch" href="//www.twitch.tv/mardok_tv" target="_blank" title="Offline"></a>
</span>
<span class="memberType">Officer</span>
</span>
</div>
</div>


I use python
requests
to get content and
lxml
to parse it

import requests
from lxml import html
ses = requests.session()
r = ses.get(SITE_URL)
webContent = html.fromstring(r.content)


first xpath:

acc = webContent.xpath("//span/a[contains(@href,'account/view-profile')]/text()")


and result:

['konterA', 'mardok']


second xpath :

twitch = webContent.xpath('//span/@data-twith-user')


and result:

['mardok_tv']


third xpath:

lastOnline = webContent.xpath('//span/@data-time')


and result:

['2017-02-20T22:37:42Z','2017-02-19T11:28:20Z']


How to join this three together to get result like this:

[['konterA','','2017-02-20T22:37:42Z'],['mardok','mardok_tv','2017-02-19T11:28:20Z']

Answer Source

Consider parsing all items together under same parent, iterating on a top-level xpath. And use XPath's concat() to return an empty length string '' if no attrib/element value exists. Below also uses XPath's normalize-space() to remove line breaks and carriage returns from values.

# PARSING POSTED SNIPPET AS STRING
webContent = html.fromstring(htmlstr)

# INITIALIZING LISTS
acc = []; twitch = []; lastOnline = []

# ITERATING THROUGH SECOND CHILD <SPAN>
for i in webContent.xpath("//span/span[1]"):    
    acc.append(i.xpath("concat(normalize-space(a[contains(@href,'account/view-profile')]),'')"))
    twitch.append(i.xpath("concat(@data-twitch-user, '')"))
    lastOnline.append(i.xpath("concat(../@data-time, '')"))

# ZIP EQUAL LENGTH LISTS
xpath_list = list(zip(acc, twitch, lastOnline))

print(xpath_list)
# [('KonterA', '', '2017-02-20T22:37:42Z'), ('mardok', 'mardok_tv', '2017-02-19T11:28:20Z')]