mariop mariop - 1 month ago 15
Scala Question

Using parboiled2 to parse multiple lines instead of a String

I would like to use parboiled2 to parse multiple CSV lines instead of a single CSV String. The result would be something like:

val parser = new CSVRecordParser(fieldSeparator)
io.Source.fromFile("my-file").getLines().map(line => parser.record.run(line))


where CSVRecordParser is my parboiled parser of CSV records. The problem that I have is that, for what I've tried, I cannot do this because parboiled parsers requires the input in the constructor, not in the run method. Thus, I can either create a new parser for each line, that is not good, or find a way to pass the input to the parser for every input that I have. I tried to hack a bit the parser, by setting the input as variable and wrapping the parser in another object

object CSVRecordParser {

private object CSVRecordParserWrapper extends Parser with StringBuilding {

val textBase = CharPredicate.Printable -- '"'
val qTextData = textBase ++ "\r\n"

var input: ParserInput = _
var fieldDelimiter: Char = _

def record = rule { zeroOrMore(field).separatedBy(fieldDelimiter) ~> (Seq[String] _) }
def field = rule { quotedField | unquotedField }
def quotedField = rule {
'"' ~ clearSB() ~ zeroOrMore((qTextData | '"' ~ '"') ~ appendSB()) ~ '"' ~ ows ~ push(sb.toString)
}
def unquotedField = rule { capture(zeroOrMore(textData)) }
def textData = textBase -- fieldDelimiter

def ows = rule { zeroOrMore(' ') }
}

def parse(input: ParserInput, fieldDelimiter: Char): Result[Seq[String]] = {
CSVRecordParserWrapper.input = input
CSVRecordParserWrapper.fieldDelimiter = fieldDelimiter
wrapTry(CSVRecordParserWrapper.record.run())
}
}


and then just call
CSVRecordParser.parse(input, separator)
when I want to parse a line. Besides the fact that this is horrible, it doesn't work and I often have strange errors related to previous usages of the parser. I know this is not the way I should write a parser using parboiled2 and I was wondering what is the best way to achieve what I would like to do with this library.

Answer

I've done this for CSV files of over 1 million records, in a project that requires high speed and low resources, and I find it works well to instantiate a new parser for each line.

I tried this approach after I noticed that the parboiled2 readme mentions that the parsers are extremely light weight.

I have not needed even to increase JVM memory or heap limits from their defaults. Parser instantiation for each line works very well.