Ke Tian Ke Tian - 6 months ago 8
Python Question

How to use the xpath to parse the director part from the html with python 3

I intend to extract the the director's name(such as tom) from the following html (this just a part example of my html, the whole html, please access http://movie.walkerplus.com/list/2015/12/) with python 3 xpath.
please give your hand to help me solve this issue.
Thanks in advance!

<title> ufffff</title>
<div class="hiragana">2015<br>Dec 1st</br></div>
<div class="movies">
<div class="movie">
<h3><a href="/mv57512/">007</a></h3>
<dl class="directorList">
<dt>director</dt>
<dd>
<a href="/person/152394/" title="">bruce</a>
</dd>
</dl>
</div>
</div>
<div class="movies">
<div class="movie">
<h3><a href="/mv57512/">wind love</a></h3>
<dl class="directorList">
<dt>director</dt>
<dd>
<a href="/person/152394/" title="">tom</a>
</dd>
</dl>
<div class="movies">
<div class="movie">
<h3><a href="/mv57512/">river war</a></h3>
<dl class="directorList">
<dt>director</dt>
<dd>
<a href="/person/152394/" title="">July</a>
</dd>
</dl>
</div>
</div>
<div class="mwb">
<div class="hiraganaLocalNavi">
<ul class="page_12">
<li class="text">o</li>
<li><a class="m01" href="/list/2015/01/">1月</a></li>
<li><a class="m02" href="/list/2015/02/">2月</a></li>
<li><a class="m03" href="/list/2015/03/">3月</a></li>
<li><a class="m04" href="/list/2015/04/">4月</a></li>
<li><a class="m05" href="/list/2015/05/">5月</a></li>
<li><a class="m06" href="/list/2015/06/">6月</a></li>
<li><a class="m07" href="/list/2015/07/">7月</a></li>
<li><a class="m08" href="/list/2015/08/">8月</a></li>
<li><a class="m09" href="/list/2015/09/">9月</a></li>
<li><a class="m10" href="/list/2015/10/">10月</a></li>
<li><a class="m11" href="/list/2015/11/">11月</a></li>
<li><a class="m12" href="/list/2015/12/">12月</a></li>
</ul>
</div>
</div>
..................

Answer

Definitively use lxml for this instead. Like this:

from lxml import etree
f = StringIO(your_html_text)
tree = etree.parse(f)
what_you_are_looking_for = tree.xpath('//*[contains(concat(' ', @class, ' '), ' movies')]')

This is a very robust way of getting the data you want and will tolerate messy life (missing tags in the html, data moving around, etc.) much better than a regular expression.

You can read more about it here. Cheers!