Daniel Falbel Daniel Falbel - 1 month ago 16
R Question

read and parse a xml in chunks in R

I'm trying to read and process a ~5.8GB

.xml
from Wikipedia Dumps using R. I don't have so much RAM so I would like to process it in chunks. (Currently when using
xml2::read_xml
blocks my computer completely)

The file contais one
xml
element for each wikipedia page, like this:

<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility" />
<revision>
<id>631144794</id>
<parentid>381202555</parentid>
<timestamp>2014-10-26T04:50:23Z</timestamp>
<contributor>
<username>Paine Ellsworth</username>
<id>9092818</id>
</contributor>
<comment>add [[WP:RCAT|rcat]]s</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]]

{{Redr|move|from CamelCase|up}}</text>
<sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1>
</revision>
</page>


A sample of the file can be found here

From my perspective, I would think It's possible to read it in chunks, something like page per page in the file. Ans save each processed
page
element as a line in a
.csv
file.

I would like to have a data.frame with the following columns.

id, title and text.

How can I do to read this
.xml
in chunks?

Answer

It can be improved, but the main ideia is here. You still need to define the best way to define the amount of lines your going to read in each interaction inside the readLines() function and also a method to read each chunk, but a solution for getting the chunks are here:

xml <- readLines("ptwiki-20161101-pages-articles.xml", n = 2000)

inicio <- grep(pattern = "<page>", x = xml)
fim <- grep(pattern = "</page>", x = xml)
if (length(inicio) > length(fim)) { # if you get more beginnings then ends
  inicio <- inicio[-length(inicio)] # drop the last one
}

chunks <- vector("list", length(inicio))

for (i in seq_along(chunks)) {
  chunks[[i]] <- xml[inicio[i]:fim[i]]
}

chunks <- sapply(chunks, paste, collapse = " ")

I've tried read_xml(chunks[1]) %>% xml_nodes("text") %>% xml_text() and it worked out.