Pedro Rodgers Pedro Rodgers - 3 months ago 25
Scala Question

Spark Scala - Split columns into multiple rows

Following the question that I post here:

Spark Mllib - Scala

I've another one doubt... Is possible to transform a dataset like this:

2,1,3
1
3,6,8


Into this:

2,1
2,3
1,3
1
3,6
3,8
6,8


Basically I want to discover all the relationships between the movies. Is possible to do this?

My current code is:

val input = sc.textFile("PATH")
val raw = input.lines.map(_.split(",")).toArray
val twoElementArrays = raw.flatMap(_.combinations(2))
val result = twoElementArrays ++ raw.filter(_.length == 1)

Answer

Given that input is a multi-line string.

scala> val raw = input.lines.map(_.split(",")).toArray
raw: Array[Array[String]] = Array(Array(2, 1, 3), Array(1), Array(3, 6, 8))

Following approach discards one-element arrays, 1 in your example.

scala> val twoElementArrays = raw.flatMap(_.combinations(2))
twoElementArrays: Array[Array[String]] = Array(Array(2, 1), Array(2, 3), Array(1, 3), Array(3, 6), Array(3, 8), Array(6, 8))

It can be fixed by appending filtered raw collection.

scala> val result = twoElementArrays ++ raw.filter(_.length == 1)
result: Array[Array[String]] = Array(Array(2, 1), Array(2, 3), Array(1, 3), Array(3, 6), Array(3, 8), Array(6, 8), Array(1))

Order of combinations is not relevant I believe.


Update SparkContext.textFile returns RDD of lines, so it could be plugged in as:

val raw = rdd.map(_.split(","))