Xinlin Feng Xinlin Feng - 9 months ago 47
Scala Question

How to count occurrences of a word inside a Array in scala when using spark?

In spark, I have a RDD which is :

textRDD[Text]


Here "Text" is a class with:

(String, Array[String])


I wish to know in this RDD how many strings in the array are the same as my target string, so is there any way to count the number of that?

The first thing I have is something like:

val count = textRDD.aggregate(0)(seqOp, (acc1, acc2) => acc1 + acc2))


But I have no idea about how to design seqOp in this case.

For example:

textRDD = sc.parallelize(
List(
("s1", Array("this", "is", "a", "sentence", "about", math)),
("s2", Array("math", "is", "an", "english", "word", "in" "english")),
("s3", Array("computer", "science", "is", "a", "science", "with", "math"))
)


How can I count the num of "math"?

Answer Source

You can count the occurrences of a String s in an Array[String] a with the count method:

a.count(_ == s)

For the seqOp in your case above, you would need to add the accumulator:

val count = textRDD.aggregate(0)((acc, pair) => acc + pair._2.count(_ == pair._1), (acc1, acc2) => acc1 + acc2))

I'm not sure if that's what you want exactly, this aggregate would count the sum of all the occurrences of the keys in the arrays, not separated by key.

If you want the occurrences per key, you could use aggregateByKey:

textRDD.map(x => (x._1, (x._1,x._2)))
  .aggregateByKey(0)((acc, pair) => acc+ pair._2.count(_ == pair._1),(a1, a2) => a1 + a2)