sheshkovsky - 1 year ago 85
Python Question

# How to get an element having its relative XPath?

I have xml file. After parsing it with

lxml
as an
etree
, I can get all of its tags as follows:

root = tree.getroot()
for e in root.iter():
print e.tag


and the output is something like this:

'{http://www.w3.org/1999/xhtml}html'
'{http://www.w3.org/1999/xhtml}meta'
'{http://www.w3.org/1999/xhtml}meta'
'{http://www.w3.org/1999/xhtml}meta'
'{http://www.w3.org/1999/xhtml}meta'
'{http://www.w3.org/1999/xhtml}script'
'{http://www.w3.org/1999/xhtml}body'
'{http://www.w3.org/1999/xhtml}section'
'{http://www.w3.org/1999/xhtml}h1'
'{http://www.w3.org/1999/xhtml}p'
'{http://www.w3.org/1999/xhtml}em'
'{http://www.w3.org/1999/xhtml}section'
'{http://www.w3.org/1999/xhtml}h1'
'{http://www.w3.org/1999/xhtml}p'
'{http://www.w3.org/1999/xhtml}a'
'{http://www.w3.org/1999/xhtml}p'
'{http://www.w3.org/1999/xhtml}p'


I want to get some elements with relative path using python/lxml/bs4. For example I want first
p
element in second
section
and I have following relative path:
/section[2]/p[1]
.

But I can not even get all sections with following code, which returns
None
:

xhtml = {http://www.w3.org/1999/xhtml}
section = xhtml + "section"
root.find(section)


EDIT: Here's part of original file:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="grammar/rash.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml" prefix="schema: http://schema.org/ prism: http://prismstandard.org/namespaces/basic/2.0/">
<meta charset="UTF-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<script src="js/jquery.min.js"><![CDATA[ ]]></script>
<script src="js/bootstrap.min.js"><![CDATA[ ]]></script>
<script src="js/rash.js"><![CDATA[ ]]></script>
<title>It ROCS! -- The RASH Online Conversion Service</title>
<meta about="#affiliation-1" property="schema:name" content="Department of Computer Science and Engineering, University of Bologna, Italy"/>
<meta about="#affiliation-2" property="schema:name" content="Oxford e-Research Centre, University of Oxford, UK"/>
<meta about="#affiliation-3" property="schema:name" content="Knowledge Media Institute, Open University, UK"/>
<meta property="prism:keyword" content="HTML-based format"/>
<meta property="prism:keyword" content="Scholarly HTML"/>
<meta property="prism:keyword" content="RASH"/>
<body>
<section role="doc-abstract">
<h1>Abstract</h1>
<p>In this poster paper we introduce the <em>RASH Online Conversion Service</em>, i.e., a Web application that allows the conversion of ODT documents into RASH, a HTML-based markup language for writing scholarly articles, and from RASH into LaTeX. This tool allows authors with no experience in HTML to easily produce HTML-based papers and supports the publishing process by generating also a LaTeX version according to the Springer LNCS and ACM ICPS layouts.</p>
</section>
<section>
<h1>Introduction</h1>
<p>The use of HTML as format for writing scholarly papers and submitting them to scholarly venues is a very popular, discussed and trendy topic within the scholarly domain. This is demonstrated by the existence of several posts within technical mailing lists of the Web community<a href="#ftn0"> </a>, by the birth of W3C community groups on such topic<a href="#ftn3"> </a>, by the development of HTML-based formats for scholarly articles<a href="#ftn4"> </a>, and by the increasing number of events that are experimenting with HTML-based formats for submissions, such as the SAVE-SD<a href="#ftn5"> </a> and LDOW<a href="#ftn6"> </a> workshops at WWW 2016, and the Extended Semantic Web Conference<a href="#ftn7"> </a>.</p>
<p>In order to foster a wider adoption of these formats, frameworks for HTML-based papers should support the needs of all the actors involved in the production, delivery and fruition of scholarly articles, with particular regards to authors and publishers. Hence, this solution calls for a number of requirements that go well beyond those used on the Web. </p>
<p>First of all, it is vital to support authors with a variety of tools to provide for an easy transition to the new format. To this end, authors should be allowed to keep using well-known current word processors rather than adopting HTML and/or pure text editors. We thus need to support the conversion from the main word processor formats (e.g., ODT and OOXML) to HTML formats, in particular when authors use only basic features, such as standard styles for paragraphs and tables. In addition, authors should be given the option to focus on the content and let appropriate tools handle the presentation layer after the conversion into the HTML-based format.</p>


In this example I want to get
<p>
element which starts with this sentence: "The use of HTML as format for writing scholarly..."

from lxml import etree