wake_wake wake_wake - 8 months ago 52
R Question

In R, how to extracting two values from XML file, looping over 5603 files and write to table

As I am rather new to R, I am trying to learn how I can extract two values from a XML file and loop over 5603 other (small, <2kb) XML files in my working directory.

I have been reading a lot of topics on 'looping', but find this rather confusing - especially because it seems that looping over XML files is different from looping over other files, correct?

I am using NSF Award data from 1975 onwards (available on http://www.nsf.gov/awardsearch/download.jsp ). The XML structure is available on http://www.nsf.gov/awardsearch/resources/Award.xsd .

For each XML file I want to write the "ZipCode" and "AwardAmount" to a table.

Running the following code I did retrieve the ZipCode and AwardAmount, but only from the very first file. How can I write a proper loop and write it to a table?

for (i in 1:length(xmlfiles)){
doc= xmlTreeParse("xmlfiles[i]", useInternal=TRUE)

Does anyone has some suggestions?

Answer Source

This might work for you. I got rid of the for loop and went with sapply.

xmlfiles <- list.files(pattern = "*.xml")
txtfiles <- gsub("xml", "txt", xmlfiles, fixed = TRUE)

txtfiles is a set of new file names to be used as the output file for each run.

sapply(seq(xmlfiles), function(i){

  doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)
  zipcode <- xmlValue(doc[["//ZipCode"]])
  amount <- xmlValue(doc[["//AwardAmount"]])
  DF <- data.frame(zip = zipcode, amount = amount)
  write.table(DF, quote = FALSE, row.names = FALSE, file = txtfiles[i])


Please, let me know if there are issues when you run it.