LearningSlowly LearningSlowly - 3 months ago 9
Scala Question

Scala RDD String manipulation

I have a RDD entitled

name
.

scala> name
res6: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[24] at map at <console>:37


I can inspect it using
name.foreach(println)


name5000005125651330
name5000005125651331
name5000005125651332
name5000005125651333


I wish to create a new RDD that removes the
name
characters from the beginning of each record and returns the remaining numbers in
long
format.

Desired outcome:

5000005125651330
5000005125651331
5000005125651332
5000005125651333


I have tried the following:

val name_clean = name.filter(_ != "name")


However this returns:

name5000005125651330
name5000005125651331
name5000005125651332
name5000005125651333

Answer

Each entry in the RDD is a string. So comparing it to "name" will always fail, as it's "name"+some digits.

What you need is map to iterate over the RDD and return a new value for each entry. And that new value should be the string, without the first 4 characters, and converted to Long.

Putting that all together, we get

name.map(_.drop(4).toLong)

If you don't know the first four characters will be "name", you might want to check that first. What you need then depends on what you want to do with rows that don't have name as the first four, but something like

name.filter(_.startsWith("name")).map(_.drop(4).toLong)