Alberto Andeliero Alberto Andeliero - 2 months ago 12
Scala Question

how to match html tag

I have to parse string like this:

foo <img ... > <strong>foo</strong> bar


and i need to replace img tag with an empty string

foo <strong>foo</strong> bar


I've tried with

<img.*>


but the result is

foo bar


How can i do?

PS: the html string is malformed

Answer

To match the tast of SO this answer will have three parts * Answer to your problem * Official rant * Cleaner soulution

Answer to the problem

* is greedy so it will match to much. Two solutions are possible:

1.) *? non greedy match all 2.) <[^>]+> all within brackets

Rant

Never parse HTML using regex. There are many subtele errors you can run into. There is also this post on this: RegEx match open tags except XHTML self-contained tags

Cleaner soultion

Parse using XML-Parser with TagSoup https://hackage.haskell.org/package/tagsoup. Here is an example that lets you treat HTML as XML like structure with Scala and tagsoup: https://github.com/daandi/spOCR/blob/master/src/main/scala/biz/neumann.parser/HTMLParser.scala

Comments