Rappster Rappster - 2 months ago 13
R Question

Memory leak when using package XML on Windows

Having read Memory leaks parsing XML in r (including linked posts) and this post on R Help and given that some time has passed again, I still think this is an unresolved issue that deserves attention as the

package is widely used throughout the R universe.

Thus please consider this as a follow up post and/or reference with a hopefully informative yet concise illustration of the problem.

Issue



Parsing XML/HTML documents in a way that they can be searched with XPath afterwards requires the internal use of C pointers (AFAIU). And it seems that at least on MS Windows (I'm running on Windows 8.1, 64 Bit) these references are not properly recognized by the garbage collector. Thus consumed memory is not properly released which leads to a freeze of an R process at some point.

Central findings so far



To me it seems that
XML:free
and/or
gc
does/do not recognize all memory involved when parsing XML/HTML docs via
xmlParse
or
htmlParse
and subsequently processing them with
xpathApply
or the like:

The reported memory usage of the OS task (Rterm.exe) is adding up significantly fast while the reported memory of the R process as "seen from within R" (function
memory.size
) increases moderately (in comparison, that is). See list elements
mem_r
,
mem_os
and
ratio
before and after a substantial parsing cycle below.

All in all and throwing in everything that has been recommended (
free
,
rm
and
gc
), memory usage still always increases when
xmlParse
and the like are called. It's just a question of how much. So IMHO there must still be something that's not working correctly.




Illustration



I borrowed the profiling code from the Duncan's Omegahat git repository.

Some preparations:

Sys.setenv("LANGUAGE"="en")
require("compiler")
require("XML")

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252

attached base packages:
[1] compiler stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] XML_3.98-1.1


Functions we need:

getTaskMemoryByPid <- cmpfun(function(
pid=Sys.getpid()
) {
cmd <- sprintf("tasklist /FI \"pid eq %s\" /FO csv", pid)
mem <- read.csv(text=shell(cmd, intern = TRUE), stringsAsFactors=FALSE)[,5]
mem <- as.numeric(gsub("\\.|\\s|K", "", mem))/1000
mem
}, options=list(suppressAll=TRUE))

memoryLeak <- cmpfun(function(
x=system.file("exampleData", "mtcars.xml", package="XML"),
n=10000,
use_text=FALSE,
xpath=FALSE,
free_doc=FALSE,
clean_up=FALSE,
detailed=FALSE
) {
if(use_text) {
x <- readLines(x)
}
## Before //
mem_os <- getTaskMemoryByPid()
mem_r <- memory.size()
prof_1 <- memory.profile()
mem_before <- list(mem_r=mem_r,
mem_os=mem_os, ratio=mem_os/mem_r)

## Per run //
mem_perrun <- lapply(1:n, function(ii) {
doc <- xmlParse(x, asText=use_text)
if (xpath) {
res <- xpathApply(doc=doc, path="/blah", fun=xmlValue)
rm(res)
}
if (free_doc) {
free(doc)
}
rm(doc)
out <- NULL
if (detailed) {
out <- list(
profile=memory.profile(),
size=memory.size()
)
}
out
})
has_perrun <- any(sapply(mem_perrun, length) > 0)
if (!has_perrun) {
mem_perrun <- NULL
}

## Garbage collect //
mem_gc <- NULL
if(clean_up) {
gc()
tmp <- gc()
mem_gc <- list(gc_mb=tmp["Ncells", "(Mb)"])
}

## After //
mem_os <- getTaskMemoryByPid()
mem_r <- memory.size()
prof_2 <- memory.profile()
mem_after <- list(mem_r=mem_r,
mem_os=mem_os, ratio=mem_os/mem_r)
list(
before=mem_before,
perrun=mem_perrun,
gc=mem_gc,
after=mem_after,
comparison_r=data.frame(
before=prof_1,
after=prof_2,
increase=round((prof_2/prof_1)-1, 4)
),
increase_r=(mem_after$mem_r/mem_before$mem_r)-1,
increase_os=(mem_after$mem_os/mem_before$mem_os)-1
)
}, options=list(suppressAll=TRUE))





Results



Scenario 1



Quick facts: garbage collection enabled, XML doc is parsed
n
times but not searched via
xpathApply


Notice the ratios of OS memory vs. R memory:

Before:
1.364832


After:
1.322702


res <- memoryLeak(clean_up=TRUE, n=50000)
save(res, file=file.path(tempdir(), "memory-profile-1.rdata"))

> res
$before
$before$mem_r
[1] 37.42

$before$mem_os
[1] 51.072

$before$ratio
[1] 1.364832


$perrun
NULL

$gc
$gc$gc_mb
[1] 45


$after
$after$mem_r
[1] 63.21

$after$mem_os
[1] 83.608

$after$ratio
[1] 1.322702


$comparison_r
before after increase
NULL 1 1 0.0000
symbol 7387 7392 0.0007
pairlist 190383 390633 1.0518
closure 5077 55085 9.8499
environment 1032 51032 48.4496
promise 5226 105226 19.1351
language 54675 54791 0.0021
special 44 44 0.0000
builtin 648 648 0.0000
char 8746 8763 0.0019
logical 9081 9084 0.0003
integer 22804 22807 0.0001
double 2773 2783 0.0036
complex 1 1 0.0000
character 44522 94569 1.1241
... 0 0 NaN
any 0 0 NaN
list 19946 19951 0.0003
expression 1 1 0.0000
bytecode 16049 16050 0.0001
externalptr 1487 1487 0.0000
weakref 391 391 0.0000
raw 392 392 0.0000
S4 1392 1392 0.0000

$increase_r
[1] 0.6892036

$increase_os
[1] 0.6370614


Scenario 2



Quick facts: garbage collection enabled,
free
is explicitly called, XML doc is parsed
n
times but not searched via
xpathApply
.

Notice the ratios of OS memory vs. R memory:

Before:
1.315249


After:
1.222143


res <- memoryLeak(clean_up=TRUE, free_doc=TRUE, n=50000)
save(res, file=file.path(tempdir(), "memory-profile-2.rdata"))
> res

$before
$before$mem_r
[1] 63.48

$before$mem_os
[1] 83.492

$before$ratio
[1] 1.315249


$perrun
NULL

$gc
$gc$gc_mb
[1] 69.3


$after
$after$mem_r
[1] 95.92

$after$mem_os
[1] 117.228

$after$ratio
[1] 1.222143


$comparison_r
before after increase
NULL 1 1 0.0000
symbol 7454 7454 0.0000
pairlist 392455 592466 0.5096
closure 55104 105104 0.9074
environment 51032 101032 0.9798
promise 105226 205226 0.9503
language 55592 55592 0.0000
special 44 44 0.0000
builtin 648 648 0.0000
char 8847 8848 0.0001
logical 9141 9141 0.0000
integer 23109 23111 0.0001
double 2802 2807 0.0018
complex 1 1 0.0000
character 94775 144781 0.5276
... 0 0 NaN
any 0 0 NaN
list 20174 20177 0.0001
expression 1 1 0.0000
bytecode 16265 16265 0.0000
externalptr 1488 1487 -0.0007
weakref 392 391 -0.0026
raw 393 392 -0.0025
S4 1392 1392 0.0000

$increase_r
[1] 0.5110271

$increase_os
[1] 0.4040627


Scenario 3



Quick facts: garbage collection enabled,
free
is explicitly called, XML doc is parsed
n
times and searched via
xpathApply
each time.

Notice the ratios of OS memory vs. R memory:

Before:
1.220429


After:
13.15629
(!)

res <- memoryLeak(clean_up=TRUE, free_doc=TRUE, xpath=TRUE, n=50000)
save(res, file=file.path(tempdir(), "memory-profile-3.rdata"))
res
$before
$before$mem_r
[1] 95.94

$before$mem_os
[1] 117.088

$before$ratio
[1] 1.220429


$perrun
NULL

$gc
$gc$gc_mb
[1] 93.4


$after
$after$mem_r
[1] 124.64

$after$mem_os
[1] 1639.8

$after$ratio
[1] 13.15629


$comparison_r
before after increase
NULL 1 1 0.0000
symbol 7454 7460 0.0008
pairlist 592458 793042 0.3386
closure 105104 155110 0.4758
environment 101032 151032 0.4949
promise 205226 305226 0.4873
language 55592 55882 0.0052
special 44 44 0.0000
builtin 648 648 0.0000
char 8847 8867 0.0023
logical 9142 9162 0.0022
integer 23109 23112 0.0001
double 2802 2832 0.0107
complex 1 1 0.0000
character 144775 194819 0.3457
... 0 0 NaN
any 0 0 NaN
list 20174 20177 0.0001
expression 1 1 0.0000
bytecode 16265 16265 0.0000
externalptr 1488 1487 -0.0007
weakref 392 391 -0.0026
raw 393 392 -0.0025
S4 1392 1392 0.0000

$increase_r
[1] 0.2991453

$increase_os
[1] 13.00485





I also tried different versions. Well, I tried to try ;-)

From source, from omegahat.org



FYI: latest Rtools 3.1 is installed and included in the Windows
PATH
(e.g. installing
stringr
form the source code worked just fine).

> install.packages("XML", repos="http://www.omegahat.org/R", type="source")
trying URL 'http://www.omegahat.org/R/src/contrib/XML_3.98-1.tar.gz'
Content type 'application/x-gzip' length 1543387 bytes (1.5 Mb)
opened URL
downloaded 1.5 Mb

* installing *source* package 'XML' ...
Please define LIB_XML (and LIB_ZLIB, LIB_ICONV)
Warning: running command 'sh ./configure.win' had status 1
ERROR: configuration failed for package 'XML'
* removing 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
* restoring previous 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'

The downloaded source packages are in
'C:\Users\rappster_admin\AppData\Local\Temp\RtmpQFZ2Ck\downloaded_packages'
Warning messages:
1: running command '"R:/home/apps/lsqmapps/apps/r/R-3.1.0/bin/x64/R" CMD INSTALL -l "R:\home\apps\lsqmapps\apps\r\R-3.1.0\library" C:\Users\RAPPST~1\AppData\Local\Temp\RtmpQFZ2Ck/downloaded_packages/XML_3.98-1.tar.gz' had status 1
2: In install.packages("XML", repos = "http://www.omegahat.org/R", :
installation of package 'XML' had non-zero exit status


Github



I did not follow the recommendations in the README on the github repo as it points to this directory that only contains a
tar.gz
of version
3.94-0
(while we're at
3.98-1.1
on CRAN).

Even though it is stated that the gihub repo is not in a standard R package structure, I tried it anyway with
install_github
- and failed ;-)

require("devtools")
> install_github(repo="XML", username="omegahat")
Installing github repo XML/master from omegahat
Downloading master.zip from https://github.com/omegahat/XML/archive/master.zip
Installing package from C:\Users\RAPPST~1\AppData\Local\Temp\RtmpQFZ2Ck/master.zip
Installing XML
"R:/home/apps/lsqmapps/apps/r/R-3.1.0/bin/x64/R" --vanilla CMD INSTALL \
"C:\Users\rappster_admin\AppData\Local\Temp\RtmpQFZ2Ck\devtools15c82d7c2b4c\XML-master" \
--library="R:/home/apps/lsqmapps/apps/r/R-3.1.0/library" --with-keep.source \
--install-tests

* installing *source* package 'XML' ...
Please define LIB_XML (and LIB_ZLIB, LIB_ICONV)
Warning: running command 'sh ./configure.win' had status 1
ERROR: configuration failed for package 'XML'
* removing 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
* restoring previous 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
Error: Command failed (1)

Answer

Whilst it is still in its infancy (only a couple of months old!), and has a few quirks, Hadley Wickham has written a library for XML parsing, xml2, that can be found on Github at https://github.com/hadley/xml2. It is restricted to reading rather than writing XML, but for parsing XML I've been experimenting and it looks like it will do the job, without the memory leaks of the xml package! It provides functions including:

  • read_xml() to read an XML file
  • xml_children() to get the child nodes of a node
  • xml_text() to get the text within a tag
  • xml_attrs() to get a character vector of the attributes and values of a node, that can be cast to a named list with as.list()

Note that you still need to ensure that you rm() the XML node objects after you're done with them, and force a garbage collection with gc(), but the memory then does actually get released to the O/S (Disclaimer: Only tested on Windows 7 but this seems to be the most 'memory leaky' platform anyway).

Hope this helps someone!

Comments