jethow jethow - 26 days ago 10
Java Question

Split up jSoup scraping result

I do scraping from this link using jSoup library on Java. My source works so well and I want to ask how to split every elements I get?

Here my source

package javaapplication1;

import java.io.IOException;
import java.sql.SQLException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class coba {

public static void main(String[] args) throws SQLException {
MasukDB db=new MasukDB();
try {
Document doc = null;
for (int page = 1; page < 2; page++) {
doc = Jsoup.connect("http://hackaday.com/page/" + page).get();
System.out.println("title : " + doc.select(".entry-title>a").text() + "\n");
System.out.println("link : " + doc.select(".entry-title>a").attr("href") + "\n");
System.out.println("body : " + String.join("", doc.select(".entry-content p").text()) + "\n");
System.out.println("date : " + doc.select(".entry-date>a").text() + "\n");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}


In the result, every page of website becomes one line, how to split it up guys? and how to get link on every article, I think my CSS selector on link side is still wrong
thanks mate

Answer Source
 doc.select(".entry-title>a").text()

This will search the entire document and return a list of links, from which you are scraping their text node. However, you are probably wanting to scrape every article and then get the pertinent data from each.

    Document doc;
    for (int page = 1; page < 2; page++) {

        doc = Jsoup.connect("http://hackaday.com/page/" + page).get();

        // get a list of articles on page
        Elements articles = doc.select("main#main article");

        // iterate article list
        for (Element article : articles) {

            // find the article header, which includes title and date
            Element header = article.select("header.entry-header").first();

            // find and scrape title/link from header
            Element headerTitle = header.select("h1.entry-title > a").first();
            String title = headerTitle.text();
            String link = headerTitle.attr("href");

            // find and scrape date from header
            String date = header.select("div.entry-meta > span.entry-date > a").text();

            // find and scrape every paragraph in the article content
            // you probably will want to further refine the logic here
            // there may be paragraphs you don't want to include
            String body = article.select("div.entry-content p").text();

            // view results
            System.out.println(
                    MessageFormat.format(
                            "title={0} link={1} date={2} body={3}", 
                            title, link, date, body));
        }
    }

See CSS Selectors for more examples on how to scrape this kind of data.