sarath sarath - 1 month ago 18
Java Question

Regex to parse html source in JSoup

I am trying to fetch values from a web page source file this is the html rules i have

e=d.select("li[id=result_48]");
e=d.select("div[id=result_48]");


this is the html tag

<li id="result_48" data-asin="0781774047" class="s-result-item">
<div id="result_48" data-asin="0781774047" class="s-result-item">


what i want to do is whatever comes in place of "li" or "div" i want to get the value inside the id .. so i want to use RegX in place of "li" or "div"

So the Jsoup element should check the id=result_48 and if something comes like that i want the data. how can i do that.

Thanks in advance

Answer

Tested with different order of attributes. Might have missed some cases so test with your actual data. Assume that there are no spaces and quotes in the id attribute.

public static void main(String[] args) throws Exception {
    String[] lines = {
            "<li id=\"result_48\" data-asin=\"0781774047\" class=\"s-result-item\">",
            "<div id=\"result_48\" data-asin=\"0781774047\" class=\"s-result-item\">",
            "<div data-asin=\"0781774047\" id=\"result_48\" class=\"s-result-item\">",
            "<div data-asin=\"0781774047\" class=\"s-result-item\" id=\"result_48\">" };
    for (String str : lines) {
        System.out.println(extractId(str));
    }
}

private static String extractId(String line) {
    String regex = "";
    regex = regex + "(?:[<](?:li|div)).*id=\""; // match start until id="
    regex = regex + "([^\\s^\"]+)"; // capture the id inside quotes (exclude
                                    // spaces and quote)
    regex = regex + "(?:.*\">)"; // match any characters until the end ">
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(line);
    if (matcher.matches()) {
        return matcher.group(1);
    }
    return null;
}