Alsqer Alsqer - 1 year ago 72
R Question

Searching for multiple paths using same code for xpathSApply

I'm trying to extract the table which contain an Arabic poem. You can check the poem in here

I tried to parse the table...

URL <- ""
Data <- htmlTreeParse(URL, useInternalNodes = TRUE,encoding = "Windows-1256")
Poem <- xpathSApply(Data,"//p[@class='poem']",xmlValue)
Poem1 <- xpathSApply(Data,"//font[@class='poem']",xmlValue)
Encoding(Poem) <- "UTF-8"
Encoding(Poem1) <- "UTF-8"

But that's not good because i changed the order which poem was written with.

So, Is there a way to get this table using only one code to get it as written in the URL ?


Poem <- xpathSApply(Data,"//p[@class='poem']&//font[@class='poem']",xmlValue)

Answer Source

The question is actually about appropriate selectors to grab multiple tags with a class of "poem". There are a few options. A simple option is to use a wildcard character * for the tag name in the XPath selector:

Poem <- xpathSApply(Data,"//*[@class='poem']",xmlValue)

If you only want p and font tags of class "poem", but not, say a div tag of the same class, you can use an | (or) operator to select multiple options. Translated to rvest, which I find a little easier to read (though the same selector works fine in xpathSApply, as well):


Poem <- URL %>% read_html() %>% 
    html_nodes(xpath = '//p[@class="poem"] | //font[@class="poem"]') %>% 
    html_text(trim = TRUE)

Another option if using rvest is to use CSS selectors instead of XPath ones. In CSS, class is specified by ., so all you need for a wildcard version is ".poem"; to limit to only p or font tags, use "p.poem, font.poem". Here's a fun tutorial on CSS selectors, if you like.

Poem <- URL %>% read_html() %>% 
    html_nodes(css = '.poem') %>% 
    html_text(trim = TRUE)

head(Poem, 15)    # I don't speak Arabic, so check that the results make sense
##  [1] "أقداح و أحلام"             "أنا لا أزال و في يدي قدحي" "ياليل أين تفرق الشرب"     
##  [4] "ما زلت أشربها و أشربها"    "حتى ترنح أفقك الرحب"       "الشرق عُفر بالضباب فما"    
##  [7] "يبدو فأين سناك يا غرب؟"    "ما للنجوم غرقن ، من سأم"   "في ضوئهن و كادت الشهب ؟"  
## [10] "أنا لا أزال و في يدي قدحي" "ياليل أين تفرق الشرب ؟"    "******"                   
## [13] "الحان بالشهوات مصطخب"      "حتى يكاد بهن ينهار"        "و كأن مصاحبيه من ضرج" 
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download