rfairy rfairy - 3 months ago 13
R Question

Webscraping html table using R

I had some help from users of Stackoverflow already, trying to solve this problem. However, I ran into new trouble:

URL <- "http://karakterstatistik.stads.ku.dk/Histogram/ASOB05038E/Summer-2015"
pg <- read_html(URL)

get_val <- function(x, label) {
xpath <- sprintf(".//table/tr/td[contains(., '%s')][1]/following-sibling::td", label)
html_nodes(x, xpath=xpath) %>%
html_text() %>%
trimws()
}

library("stringr")
trimmed = get_val(pg, "Karakter") %>%
str_replace_all(pattern = "\\n|\\t|\\r" ,
replacement = "")
trimmed


I want to get the exam results for both the retake and the exam, but since both of the headlines for the two tables are the same, R only takes the values from the retake.
To be specific, I would like to get the column "Antal" right next to the grades, 12, 10, 7, 4, 02, 00, -3 in both the tables under the headline Resultater

Any help would be appreciated a lot! :)

Answer
results <- html_nodes(pg, xpath=".//td[@style='width: 50%;' and 
                                       descendant::h3[contains(text(), 'Resultater')]]/table")

html_table(results[[1]])
##   Karakter Antal    Antal
## 1       12    11  (9,6 %)
## 2       10    48 (41,7 %)
## 3        7    41 (35,7 %)
## 4        4     4  (3,5 %)
## 5       02     1  (0,9 %)
## 6       00     1  (0,9 %)
## 7       -3     4  (3,5 %)
## 8  Ej mødt     5  (4,3 %)

html_table(results[[2]])
##   Karakter Antal    Antal
## 1       12     0  (0,0 %)
## 2       10     0  (0,0 %)
## 3        7     1  (9,1 %)
## 4        4     1  (9,1 %)
## 5       02     1  (9,1 %)
## 6       00     1  (9,1 %)
## 7       -3     0  (0,0 %)
## 8  Ej mødt     7 (63,6 %)