Shikhar Parashar Shikhar Parashar - 2 months ago 16
R Question

Web Scraping using rvest in R

I have been trying to scrap information from a url in R using the rvest package:

url <-'https://eprocure.gov.in/cppp/tendersfullview/id%3DNDE4MTY4MA%3D%3D/ZmVhYzk5NWViMWM1NTdmZGMxYWYzN2JkYTU1YmQ5NzU%3D/MTUwMjk3MTg4NQ%3D%3D'


but am not able to correctly identity the xpath even after using selector plugin.

The code i am using for fetching the first table is as follows:

detail_data <- read_html(url)
detail_data_raw <- html_nodes(detail_data, xpath='//*[@id="edit-t-
fullview"]/table[2]/tbody/tr[2]/td/table')
detail_data_fine <- html_table(detail_data_raw)


When i try the above code, the detail_data_raw results in {xml_nodeset (0)} and consequently detail_data_fine is an empty list()

The information i am interested in scrapping is under the headers:

Organisation Details

Tender Details

Critical Dates

Work Details

Tender Inviting Authority Details

Any help or ideas in what is going wrong and how to rectify it is welcome.

Answer Source

Your example URL isn't working for anyone, but if you're looking to get the data for a particular tender, then:

library(rvest)
library(stringi)
library(tidyverse)

pg <- read_html("https://eprocure.gov.in/mmp/tendersfullview/id%3D2262207")

html_nodes(pg, xpath=".//table[@class='viewtablebg']/tr/td[1]") %>% 
  html_text(trim=TRUE) %>% 
  stri_replace_last_regex("\ +:$", "") %>% 
  stri_replace_all_fixed(" ", "_") %>% 
  stri_trans_tolower() -> tenders_cols

html_nodes(pg, xpath=".//table[@class='viewtablebg']/tr/td[2]") %>% 
  html_text(trim=TRUE) %>% 
  as.list() %>% 
  set_names(tenders_cols) %>% 
  flatten_df() %>% 
  glimpse()
## Observations: 1
## Variables: 15
## $ organisation_name            <chr> "Delhi Jal Board"
## $ organisation_type            <chr> "State Govt. and UT"
## $ tender_reference_number      <chr> "Short NIT. No.20 (Item no.1) EE ...
## $ tender_title                 <chr> "Short NIT. No.20 (Item no.1)"
## $ product_category             <chr> "Civil Works"
## $ tender_fee                   <chr> "Rs.500"
## $ tender_type                  <chr> "Open/Advertised"
## $ epublished_date              <chr> "18-Aug-2017 05:15 PM"
## $ document_download_start_date <chr> "18-Aug-2017 05:15 PM"
## $ bid_submission_start_date    <chr> "18-Aug-2017 05:15 PM"
## $ work_description             <chr> "Replacement of settled deep sewe...
## $ pre_qualification            <chr> "Please refer Tender documents."
## $ tender_document              <chr> "https://govtprocurement.delhi.go...
## $ name                         <chr> "EXECUTIVE ENGINEER (NORTH)-II"
## $ address                      <chr> "EXECUTIVE ENGINEER (NORTH)-II\r\...

seems to work just fine w/o installing Python and using Selenium.