Harish Harish - 4 years ago 433
Python Question

Getting the String at nth position using python in apache Spark

lines = sc.textFile(fileName)


I am trying to get the String from position 10:20 from every line to do some processing.Since lines is an RDD its giving an syntax error saying there is no getItem.

Answer Source

Remember, lines is an RDD (collection) of Strings so you need to call something (substring) on each element. To get the result of a function call on each member of the RDD, map is your friend.

Python (courtesy of @zero323):

lines.map(lambda line: line[10:21])

Scala:

lines.map ( line => line.substring(10,20) )

This returns another RDD, so you'll need to write more transformations before your action (ie. return result or write to file), which will trigger it to run.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download