TomNash TomNash - 1 month ago 9
HTML Question

How to read an HTML table and account for line breaks within cells

I have an HTML table output from a program that separates values within a cell with

<br>
. I've tried using
XML::readHTMLTable
and
htmltab
but they glom together the values without any separators. I need them to be comma-separated, but I don't see any arguments to those functions to account for this. I've posted a psuedo example of the file below. Currently it reads into two vectors
c("ABC","DEF","GHI")
and
c("JKLMNO","PQR","STU")
but I need the
"JKLMNO"
element to instead be
"JKL,MNO"
.

<table>
<tr>
<td>
ABC<br/>
</td>
<td>
DEF<br/>
</td>
<td>
GHI<br/>
</td>
</tr>
<tr>
<td>
JKL<br/>
MNO<br/>
</td>
<td>
PQR<br/>
</td>
<td>
STU<br/
</td>
</tr>
</table>

Answer
library(rvest)
library(dplyr)

doc <- read_html("<table>
  <tr>
    <td>
      ABC<br/>
    </td>
    <td>
      DEF<br/>
    </td>
    <td>
      GHI<br/>
    </td>
  </tr>
  <tr>
    <td>
      JKL<br/>
      MNO<br/>
    </td>
    <td>
      PQR<br/>
    </td>
    <td>
      STU<br/
    </td>
  </tr>
</table>")

tab <- html_table(doc)[[1]] 

mutate(tab, X1=gsub("[\r\n][[:space:]]+", ",", X1))
##        X1  X2  X3
## 1     ABC DEF GHI
## 2 JKL,MNO PQR STU

UPDATE

For folks who have HTML in a different format and may not feel up to the strain of posting, if you had, say:

doc <- read_html("<table>
  <tr>
    <td>ABC<br/></td>
    <td>DEF<br/></td>
    <td>GHI<br/></td>
  </tr>
  <tr>
    <td>JKL<br/>MNO<br/></td>
    <td>PQR<br/></td>
    <td>STU<br/</td>
  </tr>
</table>")

the aforementioned solution won't work because it's not the same data the OP had. I know…it's shocking.

If that is the case, copying and pasting a solution is definitely easier than typing a new question and you can use the following:

library(rvest)
library(dplyr)
library(purrr)

map(1:3, function(col) {
  html_nodes(doc, xpath=sprintf(".//tr/td[%d]", col)) %>% 
  map_chr(~paste0(html_nodes(., xpath=".//text()"), collapse=","))
}) %>% 
  set_names(sprintf("X%d", 1:3)) %>% 
  as_data_frame()

But — amazingly enough — if you had different tags and data in the TD tags or had to work with a more complex table structure, this solution would likely require adaptation as well. The mind, boggles.

Comments