Captcha Captcha - 4 months ago 52
Scala Question

remove pipe delimiter from data using spark

i am new to spark, i am using scala to separate pipe delimited file and save in hdfs without pipe delimited, for that i have written this code.

object WordCount {
def main(args: Array[String])
{
val textfile = sc.textFile("/user/cloudera/xxxx/xxxx")
val word = textfile.map( l => l.split("|"))
word.saveAsTextFile("/user/cloudera/xxxxx/Sparktest")
}
}


but when i am executing it i am not getting any error's but in my hdfs i am getting below data.

[Ljava.lang.String;@10ed847f
[Ljava.lang.String;@4316ebe
[Ljava.lang.String;@495d7e18
[Ljava.lang.String;@19017f49
[Ljava.lang.String;@314b9e72
[Ljava.lang.String;@5b8f67a6
[Ljava.lang.String;@23ddf240
[Ljava.lang.String;@404b5a25
[Ljava.lang.String;@130b541d
[Ljava.lang.String;@4cbf45af
[Ljava.lang.String;@21780b86
[Ljava.lang.String;@503c9b94
[Ljava.lang.String;@3b0a3ab3


i don't know what i am doing wrong.
Please help

Answer

That's because you are splitting each string into a Array of Strings. To save as text file, you'll need to use mkString(",") if you wish to concatenate with a comma. But I don't see any purpose in that.

If you want to replace pipe separator by a comma, you can use _.replaceAll("|",",") instead and save it :

val word = textfile.map(_.replaceAll("\\|",",").replaceFirst(",","").trim)
word.saveAsTextFile("/user/cloudera/xxxxx/Sparktest")

PS : You can replace the comma with anything you want e.g a whitespace, a word, etc.

So Why does the pipe need to be escaped ?

A string split expects a regular expression argument. An unescaped | is parsed as a regex meaning "empty string or empty string," which isn't what you mean.