I am trying to form a list of strings from the header of a csv file in pyspark. The header in csv file is in unicode format. I wrote this code which reads the header but it doesn't form the list with individual values from the header:
read_file = sc.textFile('file:///file1.csv').zipWithIndex().filter(lambda (line, rownum): rownum == 0).map(lambda (line, rownum): line)
data = (read_file
.map(lambda line: line.split(","))
.filter(lambda line: len(line) >= 1)
[[u'header1', u'header2', u'header3', u'header4', u'header5']]
Easy enough to fix your specific problem:
flatMap instead of
data = read_file.flatMap(lambda l: l.split(","))
 of the result as in
data.collect() is also a solution.
However the way you currently doing it you are iterating over the whole file to discard all lines but the first one. I would recommend using
.take(1) on the rdd.
first_line = sc.textFile('test.csv').take(1) first_line.split(",")
This second solution is a lot faster on long files.
Also note that your filter function does not currently serve any purpose, you could just leave out