Carlos Argueta Carlos Argueta - 15 days ago 6x
Scala Question

Scala Spark count regex matches in a file

I am learning Spark+Scala and I am stuck with this problem. I have one file that contains many sentences, and another file with a large number of regular expressions. Both files have one element per line.

What I want is to count how many times each regex has a match in the whole sentences file. For example if the sentences file (after becoming an array or list) was represented by

["hello world and hello life", "hello i m fine", "what is your name"]
, and the regex files by
["hello \\w+", "what \\w+ your", ...]
then I would like the output to be something like:
[("hello \\w+", 3),("what \\w+ your",1), ...]

My code is like this:

object PatternCount_v2 {
def main(args: Array[String]) {
// The text where we will find the patterns
val inputFile = args(0);
// The list of patterns
val inputPatterns = args(1)
val outputPath = args(2);

val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)

// Load the text file
val textFile = sc.textFile(inputFile).cache()
// Load the patterns
val patterns = Source.fromFile(inputPatterns) => line.r).toList

val patternCounts = textFile.flatMap(line => {
pattern => {
(pattern,pattern.findAllIn(line).length )




But the compiler complains:

enter image description here

If I change the flatMap to just map the code runs but returns a bunch of empty tuples () () () ()

Please help! This is driving me crazy.


As far as I can see, there are two issues here:

  1. You should use map instead of foreach: foreach returns Unit, it performs an action with a potential side effect on each element of a collection, it doesn't return a new collection. map on the other hand transform a collection into a new one by applying the supplied function to each element

  2. You're missing the part where you aggregate the results of flatMap to get the actual count per "key" (pattern). This can be done easily with reduceByKey

Altogether - this does what you need:

val patternCounts = textFile
  .flatMap(line => => (pattern, pattern.findAllIn(line).length)))
  .reduceByKey(_ + _)