runnerpaul runnerpaul - 5 months ago 23
Java Question

Spark output to multiple files

I'm learning Spark by working through some of the examples in Learning Spark: Lightning Fast Data Analysis and then adding my own developments in.

I created this class to get a look at basic transformations and actions.

/**
* Find errors in a log file
*/

package com.oreilly.learningsparkexamples.mini.java;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

public class FindErrors {
public static void main(String args[]){
String inputFile = args[0];
String outputFile = args[1];
//Create a Spark context
SparkConf conf = new SparkConf().setAppName("findErrors");
JavaSparkContext sc = new JavaSparkContext(conf);
//Load input data
JavaRDD<String> input = sc.textFile(inputFile);
//Split up into words
JavaRDD<String> errorsRDD = input.filter(
new Function<String, Boolean>() {
public Boolean call(String x) {
return x.contains("error");
}
});
//Transform into word and count
//errorsRDD.saveAsTextFile(outputFile);

JavaRDD<String> warningsRDD = input.filter(
new Function<String, Boolean>() {
public Boolean call(String x) {
return x.contains("warning");
}
});

JavaRDD<String> badLinesRDD = errorsRDD.union(warningsRDD);

badLinesRDD.saveAsTextFile(outputFile);

System.out.println("I had " + badLinesRDD.count() + " concerning lines.");
System.out.println("Here are 10 examples:");
for(String line: badLinesRDD.take(10)){
System.out.println(line);
}

}
}


Here's the command I used to run it:

$SPARK_HOME/bin/spark-submit --class com.oreilly.learningsparkexamples.mini.java.FindErrors ./target/learning-spark-mini-example-0.0.1.jar ../files/fake_logs/log1.log ./errorLog


Here's the contents of the log file:

66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
71.19.157.174 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.174 - - [24/Sep/2014:22:26:12 +0000] "GET /favicon.ico HTTP/1.1" 200 1713 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.174 - - [24/Sep/2014:22:26:37 +0000] "GET / HTTP/1.1" 200 18785 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.174 - - [24/Sep/2014:22:26:37 +0000] "GET /jobmineimg.php?q=m HTTP/1.1" 200 222 "http://www.holdenkarau.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.174 - - [24/Sep/2014:22:26:37 +0000] "GET /jobmineimg.php?q=m HTTP/1.1" 200 222 "http://www.holdenkarau.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /warning HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /warning HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"


One thing I noticed is that the output creates several files rather than the one file I expected.

The files are:

_SUCCESS


part-00000
71.19.157.174 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"

part-00001
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"

part-00002


part-00003
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /warning HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /warning HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"


It appears as if a file is created for each "grouping" of warnings/errors. What is the blank file for though?

Also, is this likely to be something that's in my code which I haven't found yet or is it a feqature of Spark?

Answer Source

This is a feature. With saveAsTextFile Spark writes a single output file per partition, no matter if it contains data or not. Since you apply filter some input partitions, which originally contained data, can end up empty. Hence the empty files.