algui91 algui91 - 1 year ago 65
Scala Question

Idiomatic way to parsing a File in Scala

I am parsing a file in Scala, I have two kinds of files to read:

A set of train sentences, with this form:

// ...

And a set of test sentences, with this form:

// ...

So far I've used
to differentiate between the to kinds of formats:

def readDataSet(file: String): Option[Vector[LabeledSentence]] = {

def getSentenceType(s: Array[String]) = s.length match {
case 3 => Left((s(0), s(1), s(2).toInt))
case 4 => Right((s(0), s(1), s(2), s(3).toInt))
case _ => Right(("EOS", "EOS", "EOS", -1))

val filePath = getClass.getResource(file).getPath

Manage(Source.fromFile(filePath)) { source =>

val parsedTuples = source getLines() map (s => s.split("\t"))

// ..........

// Got throught each token in the file and construct a sentence
for (s <- parsedTuples) {
getSentenceType(s) match {
// When reaching the end of the sentence, save it
case Right(("EOS", "EOS", "EOS", -1)) =>
sentences += new LabeledSentence(lex.result(), po.result(), dep.result())
// if (isTrain) gold.clear()
case Left(x) =>
lex += x._1
po += x._2
dep += x._3
case Right(x) =>
lex += x._1
po += x._2
gold += x._3
dep += x._4

Is there a better/idiomatic way of simplify this code?

I have removed some part of the code not important for this purpose, If you want to see the complete code, check my github page

Answer Source

You don't need Either. Just always use a 4-tuple:

    .map {
      case Array(a, b, c, d) => Some(a, b, c, d.toInt)
      case Array(a, b, d) => Some(a, b, "", d.toInt)
      case _ => None
    }.foldLelft((List.empty[LabeledSentence], List[String].empty, List.empty[String], List.empty[String], List.empty[Int])) {
      case ((l, lex, po, gold, dep), None) =>
         (new LabeledSentence(lex.reverse, po.reverse, fold.reverse, dep.reverse)::l, List(), List(), List(), List())
      case ((l, lex, po, gold, dep), Some((a, b, c, d))) => 
         (l, a::lex, b::po, c::gold, d::dep)

You could make the last step a lot more elegant, if you rethought your approach to the lex, po, gold, dep stuff (make it a case class and/or combine with the LabeledSentence perhaps?).

And also, you gotta cut down on using mutable containers, it makes it a lot harder to understand what's going on. This is not java ...