AngryPanda AngryPanda - 1 month ago 18
Scala Question

Splitting strings in Apache Spark using Scala

I have a dataset, which contains lines in the format (tab separated):

Title<\t>Text


Now for every word in
Text
, I want to create a
(Word,Title)
pair.
For instance:

ABC Hello World


gives me

(Hello, ABC)
(World, ABC)


Using Scala, I wrote the following:

val file = sc.textFile("s3n://file.txt")
val title = file.map(line => line.split("\t")(0))
val wordtitle = file.map(line => (line.split("\t")(1).split(" ").map(word => (word, line.split("\t")(0)))))


But this gives me the following output:

[Lscala.Tuple2;@2204b589
[Lscala.Tuple2;@632a46d1
[Lscala.Tuple2;@6c8f7633
[Lscala.Tuple2;@3e9945f3
[Lscala.Tuple2;@40bf74a0
[Lscala.Tuple2;@5981d595
[Lscala.Tuple2;@5aed571b
[Lscala.Tuple2;@13f1dc40
[Lscala.Tuple2;@6bb2f7fa
[Lscala.Tuple2;@32b67553
[Lscala.Tuple2;@68d0b627
[Lscala.Tuple2;@8493285


How do I solve this?

Further reading

What I want to achieve is to count the number of
Words
that occur in a
Text
for a particular
Title
.

The subsequent code that I have written is:

val wordcountperfile = file.map(line => (line.split("\t")(1).split(" ").flatMap(word => word), line.split("\t")(0))).map(word => (word, 1)).reduceByKey(_ + _)


But it does not work. Please feel free to give your inputs on this. Thanks!

Answer

So... In spark you work using distributed data structure called RDD. They provide functionality similar to scala's collection types.

val fileRdd = sc.textFile("s3n://file.txt")
// RDD[ String ]

val splitRdd = fileRdd.map( line => line.split("\t") )
// RDD[ Array[ String ]

val yourRdd = splitRdd.flatMap( arr => {
  val title = arr( 0 )
  val text = arr( 1 )
  val words = text.split( " " )
  words.map( word => ( word, title ) )
} )
// RDD[ ( String, String ) ]

// Now, if you want to print this...
yourRdd.foreach( { case ( word, title ) => println( s"{ $word, $title }" ) } )

// if you want to count ( this count is for non-unique words), 
val countRdd = yourRdd
  .groupBy( { case ( word, title ) => title } )  // group by title
  .map( { case ( title, iter ) => ( title, iter.size ) } ) // count for every title