BarneyW BarneyW - 19 days ago 8
Scala Question

How to best combine Regex groups and streams to search a file

I would like to scan through a file, finding only the first instance of a regular expression and then return the values of the groups matched by the expression.

All my attemps so far seem very unwieldy and involve a repeated use of the regular expression, once to find the target string and then again to get the groups. I also don't like the use of the .* at the beginning and end of the Regexp.

Can anyone suggest a more elegant way of doing this.

val DateRegexp = """.*(\d\d\d\d)-(\d\d)-(\d\d).*""".r
val lineWithDate = scala.io.Source.fromFile(filenameGC).getLines().find{_.matches(""".*(\d\d\d\d)-(\d\d)-(\d\d).*""") }
lineWithDate match {
case Some(result) =>
result match {
case DateRegexp(year, month, day) =>
println(year, month, day)
}
case None =>
println("No date found in file")
}


After great input from Cyrille Corpet I now have...

val DateRegexp = """(\d\d\d\d)-(\d\d)-(\d\d)""".r.unanchored
scala.io.Source.fromFile(filenameGC).getLines().collectFirst{
case DateRegexp(y, m, d) => println(y, m, d)}

Answer

Regex already is a pattern (in the sense of pattern matching), so you can use it directly in your case statement:

fileString match {
  case DateRegexp(year, month, day) => println(year, month, day)
}

However, in your case, the .* being greedy, it will catch the last occurrence of the pattern in your string.

Thankfully, you may remove the .* at start and at end of your pattern, if you specify it to be unanchored (meaning it does not try to match the pattern to your whole string). Without the greedy *, you now catch the first occurence:

val regex = """(\d\d\d\d)-(\d\d)-(\d\d)""".r.unanchored

"1987-05-18 2002-12-14" match {
  case regex(y, m, d) => (y.toInt, m.toInt, d.toInt) // (1987, 5, 18)
}

EDIT: I realized I have not addressed the first issue of the question, which is that you do not have a String but a Seq[String]. However, once you have the extractor for a line, you only have to use it on all lines up to the first relevant one with collectFirst, which finds the first occurrence that matches one of the given cases and do something with it:

(lines: List[String]).collectFirst{
  case regex(y, m, d) => println(y, m, d)
}
Comments