count count - 2 months ago 17
R Question

Apache POI in R

I am trying to read xlsx-files into R and extract Excel formulas. It appears that Apache POI is the right tool for the Job, however I cannot get it to work. I found this Page which lists the POI components and their dependecies. I tried the following code:

require(rJava)
.jinit()
.jaddClassPath("poi-3.11-20141221.jar")
.jaddClassPath("poi-ooxml-3.11-20141221.jar")
.jaddClassPath("poi-ooxml-schemas-3.11-20141221.jar")
.jaddClassPath("xmlbeans-2.6.0.jar")

inputStream <- .jnew("java/io/FileInputStream", path.expand(file.path))

xfile <- .jnew("org/apache/poi/xssf/eventusermodel/XSSFWorkbook",
.jcast(inputStream,"java/io/InputStream"))
wext <- .jnew("org/apache/poi/xssf/extractor/XSSFExcelExtractor", xfile)

text <- .jcall(wext, "Ljava/lang/String;", "getText")


which results in an
java.lang.ClassNotFoundException
error. Can someone point me in the right direction?

Answer

UPDATE

devtools::install_git("https://gitlab.com/hrbrmstr/xlsxtractr.git")

or

devtools::install_github("hrbrmstr/xlsxtractr")

Then:

doc <- read_xlsx(system.file("extdata/wb.xlsx", package="xlsxtractr"))

extract_formulas(doc, 1)
## # A tibble: 3 × 3
##   sheet  cell          f
##   <dbl> <chr>      <chr>
## 1     1    A4 SUM(A1:A3)
## 2     1    B4 SUM(B1:B3)
## 3     1    D4 SUM(A4:B4)

Extract all formula from all sheets with formulas:

purrr::map_df(seq_along(doc), ~extract_formulas(doc, .))

It only extracts formlua, but now it's a base for other functionality that may be missing from other (wait for it…) excellent packages.


This could easily be adapted into a small package or function to take in a path to an xlsx file and extract the formulas from it:

library(xml2)
library(purrr)

# need to write code to do the unzipping and also to work with all the
# sheets from the xlsx file.

sheet <- read_xml("~/dir/wb/xl/worksheets/sheet1.xml")
ns <- xml_ns_rename(xml_ns(sheet), d1 = "x")
xml_find_all(sheet, ".//x:row", ns) %>% 
  map_df(function(row) {
    xml_find_all(row, ".//x:c", ns) %>% 
      map_df(function(col) {
        xml_find_all(col, ".//x:f", ns) %>% 
          xml_text() -> f
        if (length(f) > 0) {
          data_frame(cell=xml_attr(col, "r"), f=f)
        } else {
          NULL
        }
      })
  })
## # A tibble: 2 × 2
##    cell                     f
##   <chr>                 <chr>
## 1    B2            SUM(A1:A3)
## 2    C2 SUM(A1:A3)*SUM(A1:A3)

If you have xls files this won't work, though.

Comments