achimneyswallow achimneyswallow - 3 months ago 15
Scala Question

Reading gzipped XML in scala

As I attempted to read a xml.gz file into Scala, I received the following error:

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:701)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:567)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1896)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1761)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1799)
at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:156)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:812)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327)
at scala.xml.factory.XMLLoader$class.loadXML(XMLLoader.scala:41)
at scala.xml.XML$.loadXML(XML.scala:60)
at scala.xml.factory.XMLLoader$class.loadFile(XMLLoader.scala:50)
at scala.xml.X


I have the following code:

import scala.xml.XML
val xml = XML.loadFile("/home/vagrant/miniprojects/spark/allVotes/part-00380.xml.gz")


I have more than 2,000 xml.gz files to read in. What would be an efficient solution to this? Thank you very much!!

Answer

.xml.gz is not XML at the outer layer -- it's gzip. Use a GZIPInputStream to decompress this as it's being read:

import java.io.FileInputStream
import java.util.zip.GZIPInputStream
import scala.xml.XML

def loadXmlGz(filename : String) = {
  XML.load(new GZIPInputStream(new FileInputStream(new java.io.File(filename))))
}

var xml = loadXmlGz("/home/vagrant/miniprojects/spark/allVotes/part-00380.xml.gz")