Shihao Xu Shihao Xu - 16 days ago 7x
Scala Question

Scala - Spark Word Count, why sliding not working

I want to count the frequency of each bigram.

So I wrote

val intputFile = "bible+shakes.nopunc"
val sentences = sc.textFile(intputFile)

val bigrams = => sentence.trim.split(' ')).flatMap( wordList =>
for (i <- List.range(0, (wordList.length - 2))) yield ((wordList(i), wordList(i + 1)), 1)

val bigrams2 = => sentence.trim.split(' ')).flatMap( wordList =>
wordList.sliding(2).map{case Array(x, y) => ((x,y), 1)}

And they seems to have the same type.

scala> bigrams
res11: org.apache.spark.rdd.RDD[((String, String), Int)] = MapPartitionsRDD[7] at flatMap at <console>:28

scala> bigrams2
res12: org.apache.spark.rdd.RDD[((String, String), Int)] = MapPartitionsRDD[11] at flatMap at <console>:28

scala> bigrams.collect
res15: Array[((String, String), Int)] = Array(((holy,bible),1), ((bible,authorized),1), ((authorized,king),1), ((king,james),1), ((james,version),1), ((version,textfile),1), ((in,the),1), ((the,beginning),1), ((beginning,god),1), ((god,created),1), ((created,the),1), ((the,heaven),1), ((heaven,and),1), ((and,the),1), ((and,the),1), ((the,earth),1), ((earth,was),1), ((was,without),1), ((without,form),1), ((form,and),1), ((and,void),1), ((void,and),1), ((and,darkness),1), ((darkness,was),1), ((was,upon),1), ((upon,the),1), ((the,face),1), ((face,of),1), ((of,the),1), ((the,deep),1), ((deep,and),1), ((and,the),1), ((the,spirit),1), ((spirit,of),1), ((of,god),1), ((god,moved),1), ((moved,upon),1), ((upon,the),1), ((the,face),1), ((face,of),1), ((of,the),1), ((and,god),1), ((god,said),1), ((...

However, when I do so

scala> bigrams.collect
res13: Array[((String, String), Int)] = Array(((holy,bible),1), ((bible,authorized),1), ((authorized,king),1), ((king,james),1), ((james,version),1), ((version,textfile),1), ((in,the),1), ((the,beginning),1), ((beginning,god),1), ((god,created),1), ((created,the),1), ((the,heaven),1), ((heaven,and),1), ((and,the),1), ((and,the),1), ((the,earth),1), ((earth,was),1), ((was,without),1), ((without,form),1), ((form,and),1), ((and,void),1), ((void,and),1), ((and,darkness),1), ((darkness,was),1), ((was,upon),1), ((upon,the),1), ((the,face),1), ((face,of),1), ((of,the),1), ((the,deep),1), ((deep,and),1), ((and,the),1), ((the,spirit),1), ((spirit,of),1), ((of,god),1), ((god,moved),1), ((moved,upon),1), ((upon,the),1), ((the,face),1), ((face,of),1), ((of,the),1), ((and,god),1), ((god,said),1), ((...

scala> bigrams2.collect
16/10/05 10:17:52 ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 20)
scala.MatchError: [Ljava.lang.String;@3224ea91 (of class [Ljava.lang.String;)
at $line27.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2$$anonfun$apply$1.apply(<console>:29)

res25: Array[((String, String), Int)] = Array(((holy,bible),1), ((bible,authorized),1), ((authorized,king),1), ((king,james),1), ((james,version),1))

The second way to evaluate it caused an error.

Why? How to fix it? I prefer the second, the precise way.


The problem with your wordList.sliding(2).map{case Array(x, y) => ((x,y), 1) is that {case Array(x, y) => ((x,y), 1) is a partial-function whcih only knows how to deal with an input which matches the pattern Array(x, y).

And thus you map will not be able to deal with the windows having just one element. You should change it to something like following,

wordList.sliding(2).flatMap {
  case Array(x, y) => Some((x, y), 1)
  case _ => None

Here, flatMap will flatten the Options thus ensuring that the result contains only valid bi-grams.