Captcha Captcha - 4 months ago 15
Scala Question

Selecting particular column using Spark

i am having a file in hdfs which is comma

(,)
separated, I am trying to extract 6th column using scala for that i have written below code

object WordCount {
def main(args: Array[String])
{
val textfile = sc.textFile("/user/cloudera/xxx/xxx")
val word = textfile.filter( x => x.length > 0 ).map(_.replaceAll("\\|",",").trim)
val keys = word.map(a => a(5))
keys.saveAsTextFile("/user/cloudera/xxx/Sparktest")
}
}


but the result i am getting in HDFS is not what i want.

Previously my data was :

MSH|^~\&|RQ|BIN|SMS|BIN|2019||ORU^R01|120330003918|J|2.2
PID|1|xxxxx|xxxx||TEST|Rooney|19761202|M|MR^^M^MR^MD^11|7|0371 HOES LANE^0371


Now my data is :

\
T
I
,
1
N
\
T
I
,
1
N
\
T
I


I want my result yo be :

BIN
TEST


I don't know what i am doing wrong. Please help

Answer

You're replacing | with ,, but you're not splitting by comma, so word still has type RDD[String], and not RDD[Array[String]] as you seem to expect. Then, a => a(5) treats each string as an array of chars, thus the result you're seeing.

Not sure why you'd replace the pipes with commas in the first place, you can just:

val word = textfile.filter(x => x.length >  0).map(_.split('|'))
val keys = word.map(a => a(5).trim)
Comments