PShah PShah - 1 year ago 90
Java Question

occurence of word separated by delimiter in java spark

I am beginner in java spark. I am trying to get the total counts of words separated by delimiter (suppose '|').

Contents of inputfile is :

hello | java | this | is | spark

But instead of output as 5 I am getting output as 33. Can anyone please suggest how I can rewrite the below function.Thank you!

public class wordcount {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Search").setMaster("local");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> inputFile = sparkContext.textFile("src/main/resources/inputFile");
JavaRDD<String> words = inputFile.flatMap(
new FlatMapFunction<String, String>() {
public Iterable<String> call(String line) {
return Arrays.asList(line.split("|"));
long wordCount = words.count();

Answer Source

You are splitting the line on "empty string" or "empty string" (and therefore counting characters of the line) because split() accepts a regular expression, so | or an OR condition.

This is a Java problem, not a problem with Spark.

You can fix that with split("\\|") (escape the delimeter), but you may want to capture the whitespace as well, so split("\\s*\\|\\s*")