achimneyswallow achimneyswallow - 1 year ago 83
Scala Question

Reading gzipped XML in scala

As I attempted to read a xml.gz file into Scala, I received the following error:

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:701)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:567)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1896)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1761)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1799)
at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:156)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:812)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327)
at scala.xml.factory.XMLLoader$class.loadXML(XMLLoader.scala:41)
at scala.xml.XML$.loadXML(XML.scala:60)
at scala.xml.factory.XMLLoader$class.loadFile(XMLLoader.scala:50)
at scala.xml.X


I have the following code:

import scala.xml.XML
val xml = XML.loadFile("/home/vagrant/miniprojects/spark/allVotes/part-00380.xml.gz")


I have more than 2,000 xml.gz files to read in. What would be an efficient solution to this? Thank you very much!!

Answer Source

.xml.gz is not XML at the outer layer -- it's gzip. Use a GZIPInputStream to decompress this as it's being read:

import java.io.FileInputStream
import java.util.zip.GZIPInputStream
import scala.xml.XML

def loadXmlGz(filename : String) = {
  XML.load(new GZIPInputStream(new FileInputStream(new java.io.File(filename))))
}

var xml = loadXmlGz("/home/vagrant/miniprojects/spark/allVotes/part-00380.xml.gz")