mrquad mrquad - 3 months ago 12
R Question

rvest - Scrape list and store items separately

I am trying to scrape information from a webpage:

rm(list = ls())

library(rvest)
library(XML)
library(dplyr)

utils::setInternet2(TRUE)
options(download.file.method = "internal")

url <-"http://www.home24.at/smood/premium-komfortmatratze-smood-180-x-200cm"

pgsession <- html_session(url) ## create session
pgform <- html_form(pgsession)[[1]] ## pull form from session

pflege <- pgsession %>%
jump_to(url) %>%
read_html() %>% html_nodes(xpath="//*[@id='product-details']/div/div[2]/div[2]/div[2]/div[5]/ul") %>%
html_text()


I get the results like the following back:

"Doppeltuchbezug bis 95°C waschbarWebstoffbezug kann in die Reinigung gegeben werden"


However, I would like to get the results like that back, so separated for each list item:

"Doppeltuchbezug bis 95°C waschbar", "Webstoffbezug kann in die Reinigung gegeben werden"


Any suggestions how to separate the strings and scrape each list item individually?

Answer

You just need to use an XPath or CSS selector that will select both elements you want. To find an appropriate selector, inspect the HTML in a web browser; automatically generated ones are rarely optimal.

# pull page once and store in case you want to parse multiple elements
page <- pgsession %>% jump_to(url) %>% read_html()

page %>% html_nodes(xpath = '//*[@data-reactid="350"]/li') %>% html_text()

## [1] "Doppeltuchbezug bis 95°C waschbar"                 
## [2] "Webstoffbezug kann in die Reinigung gegeben werden"
Comments