Daniel Falbel Daniel Falbel - 9 months ago 40
R Question

read and parse a xml in chunks in R

I'm trying to read and process a ~5.8GB

from Wikipedia Dumps using R. I don't have so much RAM so I would like to process it in chunks. (Currently when using
blocks my computer completely)

The file contais one
element for each wikipedia page, like this:

<redirect title="Computer accessibility" />
<username>Paine Ellsworth</username>
<comment>add [[WP:RCAT|rcat]]s</comment>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]]

{{Redr|move|from CamelCase|up}}</text>

A sample of the file can be found here

From my perspective, I would think It's possible to read it in chunks, something like page per page in the file. Ans save each processed
element as a line in a

I would like to have a data.frame with the following columns.

id, title and text.

How can I do to read this
in chunks?

Answer Source

It can be improved, but the main ideia is here. You still need to define the best way to define the amount of lines your going to read in each interaction inside the readLines() function and also a method to read each chunk, but a solution for getting the chunks are here:

xml <- readLines("ptwiki-20161101-pages-articles.xml", n = 2000)

inicio <- grep(pattern = "<page>", x = xml)
fim <- grep(pattern = "</page>", x = xml)
if (length(inicio) > length(fim)) { # if you get more beginnings then ends
  inicio <- inicio[-length(inicio)] # drop the last one

chunks <- vector("list", length(inicio))

for (i in seq_along(chunks)) {
  chunks[[i]] <- xml[inicio[i]:fim[i]]

chunks <- sapply(chunks, paste, collapse = " ")

I've tried read_xml(chunks[1]) %>% xml_nodes("text") %>% xml_text() and it worked out.