count count - 4 months ago 23
R Question

Apache POI in R

I am trying to read xlsx-files into R and extract Excel formulas. It appears that Apache POI is the right tool for the Job, however I cannot get it to work. I found this Page which lists the POI components and their dependecies. I tried the following code:


inputStream <- .jnew("java/io/FileInputStream", path.expand(file.path))

xfile <- .jnew("org/apache/poi/xssf/eventusermodel/XSSFWorkbook",
wext <- .jnew("org/apache/poi/xssf/extractor/XSSFExcelExtractor", xfile)

text <- .jcall(wext, "Ljava/lang/String;", "getText")

which results in an
error. Can someone point me in the right direction?







doc <- read_xlsx(system.file("extdata/wb.xlsx", package="xlsxtractr"))

extract_formulas(doc, 1)
## # A tibble: 3 × 3
##   sheet  cell          f
##   <dbl> <chr>      <chr>
## 1     1    A4 SUM(A1:A3)
## 2     1    B4 SUM(B1:B3)
## 3     1    D4 SUM(A4:B4)

Extract all formula from all sheets with formulas:

purrr::map_df(seq_along(doc), ~extract_formulas(doc, .))

It only extracts formlua, but now it's a base for other functionality that may be missing from other (wait for it…) excellent packages.

This could easily be adapted into a small package or function to take in a path to an xlsx file and extract the formulas from it:


# need to write code to do the unzipping and also to work with all the
# sheets from the xlsx file.

sheet <- read_xml("~/dir/wb/xl/worksheets/sheet1.xml")
ns <- xml_ns_rename(xml_ns(sheet), d1 = "x")
xml_find_all(sheet, ".//x:row", ns) %>% 
  map_df(function(row) {
    xml_find_all(row, ".//x:c", ns) %>% 
      map_df(function(col) {
        xml_find_all(col, ".//x:f", ns) %>% 
          xml_text() -> f
        if (length(f) > 0) {
          data_frame(cell=xml_attr(col, "r"), f=f)
        } else {
## # A tibble: 2 × 2
##    cell                     f
##   <chr>                 <chr>
## 1    B2            SUM(A1:A3)
## 2    C2 SUM(A1:A3)*SUM(A1:A3)

If you have xls files this won't work, though.