Rana Rana - 1 month ago 9
Scala Question

Apache Spark RDD Split "|"

I am trying to produce a formatted CSV file from pipe("|") delimited file using Apache Spark . input file contains:


apple|ball|cat

Blacktown| Bela vista| Greenacre

x|y|z


I am trying with:

val name= sc.textFile(input.txt")
val split=name.map(line=>line.split("|")).map( x => (x(0),x(2)) )
split.foreach(println)


Output:


(x,y)

(a,p)

(B,a)


My required output is:


(apple,cat)

(Blacktown, Greenacre)

(x,z)

Answer

An argument for split function is a regular expression so if you want to use pipe it has to be escaped:

line.split("\\|")

otherwise it interpreted as an alternation. It is also better to validate the input:

names.map(_.split("\\|")).collect {
  case Array(x, _, y) => (x, y)
}