User1232187 User1232187 - 3 months ago 25
Scala Question

How to correctly parse a text file with a RegexParser?

I want to parse the following test data: It works for 3 case, so I think there's a problem in my regex. If a line starts with a # and has a comment that also starts with a # it stops working. Can someone explain why? Here's my solution so far...

val testDate =
"""
|127.0.0.1 ads234.com
|#127.0.0.1 auto.search.msn.com # Microsoft uses this server to redirect
|#127.0.0.1 sitefinder.verisign.com # Verisign has joined the game
|#127.0.0.1 sitefinder-idn.verisign.com # of trying to hijack mistyped
|#127.0.0.1 s0.2mdn.net # This may interfere with some streaming
|#127.0.0.1 ad.doubleclick.net # This may interfere with www.sears.com
|127.0.0.1 media.fastclick.net # Likewise, this may interfere with some
|127.0.0.1 cdn.fastclick.net
""".stripMargin


I want to keep the # and the comment if there's any.

object Example extends RegexParsers {
def comment: Parser[String] = """#.*""".r
def url: Parser[String] = """[A-Za-z0-9-\.\_\-]{1,65}(?<!-)\.+[A-Za-z]{2,7}""".r
def localhost: Parser[String] = """\b(\d{1,3}\.){3}\d{1,3}\b""".r
def pound: Parser[String] = "#".r
def port: Parser[String] = """:\d{3}""".r

def urlPort = url | url <~ port

def pos1 = localhost ~ urlPort ^^ {
case _ ~ dns => LineParsed("", dns, "")
}
def pos2 = pound ~ localhost ~ urlPort ^^ {
case p ~ _ ~ dns => LineParsed(p, dns, "")
}
def pos3 = localhost ~ urlPort ~ comment ^^ {
case _ ~ dns ~ com => LineParsed("", dns, com)
}
def pos4 =enter code here pound ~ localhost ~ urlPort ~ comment ^^ {
case p ~ _ ~ dns ~ com => LineParsed(p, dns, com)
}

def linePos = pos1 | pos2 | pos3 | pos4

def fullLine = repsep(linePos, """\W*""".r)
}


Got the following exception:

#127.0.0.1 auto.search.msn.com # Microsoft uses this server to redirect

^
java.lang.RuntimeException: No result when parsing failed

Answer

There are a few mistakes in your code. First, by default newline characters are counted as whitespaces, but you need to "see" them to break entries correctly. So you need to redefine whitespaces:

object Example extends RegexParsers {
   override protected val whiteSpace: Regex = "[ \t]+".r  

The fullLine parser is then written as:

   //allow several empty lines at the beginning and between entries
   def fullLine = rep("\n") ~> repsep(linePos, rep1("\n")) 

(Another option would be to split the line beforehand and parse them individually)

The next mistake is the way you combine parsers with |. To parse A optionally followed by B, don't write A | A ~ B. It will never try to read a B after reading a A because the left hand side is already a success. Write instead: A ~ B.?

  def urlPort = url <~ port.?  // But anyway, you'll neve have a port in a host file !

In the same way, the 4 cases pos1 | pos2 | pos3 | pos4 can be much simplified:

  def linePos = pound.? ~ localhost ~ urlPort ~ comment.? ^^ {
     case p ~ _ ~ dns ~ com  ⇒ LineParsed(p.getOrElse(""), dns,com.getOrElse(""))
  }

You can see here how the ? combinator gives you back an Option for p and com. I use getOrElse to fit in the structure of LineParsed and keep the original behaviour of your code, but a much more scala-ish approach would be to keep it as an option in the LineParsed class.

Here is the final working code that parses your example:

object Example extends RegexParsers {
  override protected val whiteSpace: Regex = "[ \t]+".r
  def comment: Parser[String] = """#.*""".r
  def url: Parser[String] = """[A-Za-z0-9-\.\_\-]{1,65}(?<!-)\.+[A-Za-z]{2,7}""".r
  def localhost: Parser[String] = """\b(\d{1,3}\.){3}\d{1,3}\b""".r
  def pound: Parser[String] = "#".r
  def port: Parser[String] = """:\d{3}""".r
  def urlPort = url <~ port.?

  def linePos = pound.? ~ localhost ~ urlPort ~ comment.? ^^ {
    case p ~ _ ~ dns ~ com  ⇒ LineParsed(p.getOrElse(""), dns, com.getOrElse(""))
  }

  def fullLine = rep("\n") ~> repsep(linePos, rep1("\n"))
}