Jeremy Hunts Jeremy Hunts - 2 days ago 4
Java Question

Split raw html String to lines again in Jsoup

So I extracted the raw html code from a website, but it was all put in one string, I want to split it into lines just like the "view page source" on google chrome.

This is my code.

String url = "https://stratechery.com/2016/how-google-cloud-platform-is-challenging-aws/";
//crawl(url," more Complete Footwear.txt",9000);

System.out.println(br2nl(url));
Document doc = Jsoup.connect(url)
.data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.post();
String rawhtml =doc.toString();
String lines[] = rawhtml.split("\""+" ");


I tried to split the "rawhtml" string based on quotes and spaces but they are all over every line so it made splits everywhere.

Tim Tim
Answer

I think you might be missing the point of Jsoup.

You don't have to do the parsing yourself line by line, Jsoup has methods to do that. The HTML is already parsed in the JSOUP Document you created. You can now access its elements one by one, or in a grouped fashion. The possibilities are endless, take a look at the official docs: https://jsoup.org/cookbook/

To answer your question nonetheless, to split the whole HTML String by newlines, you could do this:

public class JsoupTest {

  public static void main(String[] args) throws IOException {

    String url = "https://stratechery.com/2016/how-google-cloud-platform-is-challenging-aws/";

    Document doc = Jsoup.connect(url)
        .userAgent("Mozilla")
        .get();

    for (String s : doc.toString().split("\\n")) {
      System.out.println(s);
    }
  }
}
Comments