Jonathan Aurry Jonathan Aurry - 1 month ago 9
Java Question

How to parse a string in java to get only some parts of it

I need to parse a string like this one:

"<img src=\"some_link\" height=\"200\" width=\"auto\" /><br><br\>"Lorem ipsum dolor si amet...\" Name<br>address<br>www.google.com<br>01 42 42 42 42"


I need everything after the img tag but I want each one separate: the lorem ipsum part / the name part / the web link part / the phone number

I'm not really here for code example but for some method and techniques to do it. At first I wanted to just delete the img part and replace the br tag with \n but it would be great to have each information separate so that I can work on them.

EDIT:
I used Jsoup as metionned below and it works fine! Thanks

Answer

Because this is not just any string, but HTML, you should use an HTML parser (never ever attempt parsing HTML with regex).

jsoup is the best choice in Java:

    String html = "<img src=\"some_link\" height=\"200\" width=\"auto\" /><br><br\\>\"Lorem ipsum dolor si amet...\" Name<br>address<br>www.google.com<br>01 42 42 42 42";
    Document doc = Jsoup.parse(html);

    for (Element e : doc.select("*")) {
        for (TextNode tn : e.textNodes()) {
            System.out.println(tn.text());
        }
    }
Comments