Z Shiliao Z Shiliao - 6 months ago 16
HTML Question

How to identify a node with its XML value in XPath?

I use R to scrap a web site, and when parsing the HTML code, I have this code below:

<div class="line">
<h2 class="clearfix">
<span class="property">Number<div>number extra</div></span>
<span class="value">3</span>
</h2>
</div>
<div class="line">
<h2 class="clearfix">
<span class="property">Surface</span>
<span class="value">72</span>
</h2>
</div>


Now I would like to get some values in this code.


  • How to identify the span with the xml value "Number". and get the node, in order to extract "number extra" ?
    I know how to use xpathApply to identify nodes in order to get the xmlValue or some attributes (like
    href
    with
    xmlGetAttr
    ). But I don't know how to identify a node with knowing its xmlvalue.

    xpathApply(page, '//span[@class="property"]',xmlValue)

  • If I want to get the "value" 72 for the property class "Surface", what is the most efficient way ?



Here's I started to do :
First, I extract all "property":

xpathApply(page, '//span[@class="property"]',xmlValue)


Then I extract all "value":

xpathApply(page, '//span[@class="value"]',xmlValue)


Then I build a list or a matrix, so that I can identify the value of "Surface", which is 72. But the problem is that sometimes, a span with
class="property"
can not have a span with class="value" that just follows in a
h2
. So I can not build a proper list.

Could this be the most efficient way ?: identify the span with
class="property"
, then identify the
h2
that contains this
span
, then identify the
span
with
class="value"
?

Answer

For your HTML made to be well-formed by adding a single root element,

<?xml version="1.0" encoding="UTF-8"?>
<r> 
  <div class="line"> 
    <h2 class="clearfix"> 
      <span class="property">Number
        <div>number extra</div>
      </span>  
      <span class="value">3</span> 
    </h2> 
  </div>  
  <div class="line"> 
    <h2 class="clearfix"> 
      <span class="property">Surface</span>  
      <span class="value">72</span> 
    </h2> 
  </div> 
</r>

(A) This XPath expression,

//span[@class='property' and starts-with(., 'Number')]/div/text()

will return

number extra

as requested.


(B) This XPath expression,

//h2[span[@class='property' and . = 'Surface']]/span[@class='value']/text()

will return

72

as requested.