oscarm oscarm - 1 year ago 180
Scala Question

How to create a bigram from a text file with frequency count in Spark/Scala?

I want to take a text file and create a bigram of all words not separated by a dot ".", removing any special characters. I'm trying to do this using Spark and Scala.

This text:

Hello my Friend. How are

you today? bye my friend.

Should produce the following:

hello my, 1

my friend, 2

how are, 1

you today, 1

today bye, 1

bye my, 1

Answer Source

For each of the lines in the RDD, start by splitting based on '.'. Then tokenize each of the resulting substrings by splitting on ' '. Once tokenized, remove special characters with replaceAll and convert to lowercase. Each of these sublists can be converted with sliding to an iterator of string arrays containing bigrams.

Then, after flattening and converting the bigram arrays to strings with mkString as requested, get a count for each one with groupBy and mapValues.

Finally flatten, reduce, and collect the (bigram, count) tuples from the RDD.

val rdd = sc.parallelize(Array("Hello my Friend. How are",
                               "you today? bye my friend."))


    // Split each line into substrings by periods
    _.split('.').map{ substrings =>

        // Trim substrings and then tokenize on spaces
        substrings.trim.split(' ').

        // Remove non-alphanumeric characters, using Shyamendra's
        // clean replacement technique, and convert to lowercase
        map{_.replaceAll("""\W""", "").toLowerCase()}.

        // Find bigrams

    // Flatten, and map the bigrams to concatenated strings
    flatMap{identity}.map{_.mkString(" ")}.

    // Group the bigrams and count their frequency


// Reduce to get a global count, then collect

// Format and print
foreach{x=> println(x._1 + ", " + x._2)}

you today, 1
hello my, 1
my friend, 2
how are, 1
bye my, 1
today bye, 1    
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download