blue-sky blue-sky - 3 months ago 13
Scala Question

Matching hrefs within String using Scala

Using this regex

(<a[^>]+>.+?<\\/a>)
I'm attempting to print the matching links.

So
t1,t2,t3
should be printed but nothing is printed :

val str = "tester<a href=\"t1\">this is just test text<a href=\"t2\">\\r\\t\\s<a href=\"t3\">"
val re = "(<a[^>]+>.+?<\\/a>)".r
for (p <- re findAllIn str) p match {
case re(b) => print(b)
}


Is the regex or how the regex is implemented incorrect ?

Update :

Using accepted answer this will download all valid hrefs (begin with 'http') from a url, in this case
https://news.ycombinator.com/
:

import scala.io.Source
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import scala.collection.JavaConversions._

object Main extends App {

val hrefs = getHrefsFromPage("https://news.ycombinator.com/");

hrefs.foreach(e => println(e))

def getHrefsFromPage(url: String): List[(String, String)] = {

val doc = Jsoup.parse(Source.fromURL(url).mkString)
val aTags = doc.select("a").iterator.toList
val ts = (for (t <- aTags) yield (t.attr("href"), t.text))
val fts = ts.filter(f => f._1.trim.startsWith("http"))

return fts;
}

}

Answer

Read this SO Answer first please.

Now coming back.

You need to use a reliable html parser lib to parse html strings, regex won't be enough in most non-trivial cases.

Regex won't get the job done because

  • It is error prone, we make mistakes writing regex all the time, plus you are tge only verifier and maintainer (maintenance nightmare)
  • It is hard to maintain and document
  • It is hard to test, you will have to think of all possible test case strings for your regex and then write test cases for it.

Why an Html parser is better

  • Not error prone, has been verified by multiple contributors and users, unlike your regex which only you use and verify

  • Documented in its own site and javadoc

  • Html Parsing already tested in the library itself, you can focus on testing your app functionality or business use case.

  • CSS selectors and DOM structure to select and manipulate the Html. (This is the biggest benefit, you will need css selectors support for any serious html work.)

As a result of this, I would suggest you to use Jsoup html parser. Below I describe usage for your case.

First get the dependency or just download the jar. Maven dependency as below:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.9.2</version>
</dependency>

Next the imports

import org.jsoup.Jsoup
import org.jsoup.nodes.Document

Now parsing your html string

val str = "tester<a href=\"t1\">this is just test text<a href=\"t2\">\\r\\t\\s<a href=\"t3\">"
val doc = Jsoup.parse(str)

What this gives:

doc: org.jsoup.nodes.Document =
<html>
 <head></head>
 <body>
  tester
  <a href="t1">this is just test text</a>
  <a href="t2">\r\t\s</a>
  <a href="t3"></a>
 </body>
</html>

Notice the full structure generated with cleaned tags from your string.

Getting all <a> tags

val aTags = doc.select("a")

Result:

aTags: org.jsoup.select.Elements =
<a href="t1">this is just test text</a>
<a href="t2">\r\t\s</a>
<a href="t3"></a>

Getting all <a> tag string representation

val aTagsString = aTags.toString

Result:

aTagsString: String =
<a href="t1">this is just test text</a>
<a href="t2">\r\t\s</a>
<a href="t3"></a>

Getting first or 0th <a> tag

val firstATag = doc.select("a").get(0)

Result:

firstATag: org.jsoup.nodes.Element = <a href="t1">this is just test text</a>

Getting string representation of first <a> tag

val firstATagString = firstATag.toString

Result:

firstATagString: String = <a href="t1">this is just test text</a>

Getting inner text of firstATag (0th <a> tag)

val firstATagInnerText = firstATag.text

Result:

firstATagInnerText: String = this is just test text

Notice: even though your tags were not closed this parser worked fine. While your regex implementation failed this edge case.

Comments