Alvaro Gomez Alvaro Gomez - 15 days ago 11
Java Question

Get JSON elements from a web with Apache Flink

After reading several documentation pages of Apache Flink (official documentation, dataartisans) as well as the examples provided in the official repository, I keep seeing examples where they use as the data source for streamming a file already downloaded, connecting always to the localhost.

I am trying to use Apache Flink to download JSON files which contain dynamic data. My intention is to try to stablish the url where I can access the JSON file as the input source of Apache Flink, instead of downloading it with another system and processing the downloaded file with Apache Flink.

Is it possible to stablish this net connection with Apache Flink?

Answer

You can define the URLs you want to download as your input DataStream and then download the documents from within a MapFunction. The following code demonstrates this:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<String> inputURLs = env.fromElements("http://www.json.org/index.html");

inputURLs.map(new MapFunction<String, String>() {
    @Override
    public String map(String s) throws Exception {
        URL url = new URL(s);
        InputStream is = url.openStream();

        BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(is));

        StringBuilder builder = new StringBuilder();
        String line;

        try {
            while ((line = bufferedReader.readLine()) != null) {
                builder.append(line + "\n");
            }
        } catch (IOException ioe) {
            ioe.printStackTrace();
        }

        try {
            bufferedReader.close();
        } catch (IOException ioe) {
            ioe.printStackTrace();
        }

        return builder.toString();
    }
}).print();

env.execute("URL download job");
Comments