Vasile Vasile - 1 month ago 6
R Question

Unequal number of elements when webscraping

I want to scrape some car data from autotrader.co.uk. When you search on this site each page contains info for 12 cars. I am scraping separately the price and model which gives me 2 vectors of 12 elements (using rvest). However I can not scrape separately miles, age etc. as they are in a line with other variables and their position for each car might change depending on how many variables are included by seller.
If you look at the included image the CSS for registration year used for Toyota will give me CAT C for the Ford KA and not the year as this variable is in the second position for this car. So I have to use the CSS for entire line to capture the information.

enter image description here

I decided to scrape the entire line (named the resulting vector

info
). However, this approach gives me a vector of 80+ elements (for each variable such as year, miles etc.). The problem is that I would like to join the model, price and the rest of info in a data frame and I can not do this since the
info
has more elements than the first two vectors.

The code I used:

URL <- "http://www.autotrader.co.uk/car-search?sort=price-asc&radius=1500&postcode=np198jj&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&page="
link <-read_html(URL)
price <- html_nodes(link, ".search-result__price") %>%
html_text()
info <- html_nodes(link, ".search-result__attributes li") %>%
html_text()


Using
xpath
for info gives same 80+elements.
I also tried to concancanate the elements for each car in info, but was not successful:

str_replace_all(info, collapse = "---")


So my question is how I can scrape the information on year, miles etc so that these all are one element for each car. Alternativelly maybe there is a possibility to target the year, miles and the rest of variables separatelly.

Answer

Fixed the URL and dropped the li on attributes:

library(rvest)
URL <- "http://www.autotrader.co.uk/car-search?sort=price-asc&radius=1500&postcode=np198jj&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New"
> link <- read_html(URL)
> price <- html_nodes(link, ".search-result__price") %>%
>   html_text()
> info <- html_nodes(link, ".search-result__attributes") %>%
>   html_text()
> identical(length(price), length(info))
[1] TRUE