user2966197 user2966197 - 6 months ago 30
Python Question

form a list of strings from the header of a csv file in pyspark

I am trying to form a list of strings from the header of a csv file in pyspark. The header in csv file is in unicode format. I wrote this code which reads the header but it doesn't form the list with individual values from the header:

def filter(line):

return line

read_file = sc.textFile('file:///file1.csv').zipWithIndex().filter(lambda (line, rownum): rownum == 0).map(lambda (line, rownum): line)

data = (read_file
.map(lambda line: line.split(","))
.filter(lambda line: len(line) >= 1)

print data.collect()

The output I see looks like this:

[[u'header1', u'header2', u'header3', u'header4', u'header5']]

while I want it to be

How can I correct it and form the list?


Easy enough to fix your specific problem: just use flatMap instead of map

data = read_file.flatMap(lambda l: l.split(","))

Obviously taking [0] of the result as in data.collect()[0] is also a solution.

However the way you currently doing it you are iterating over the whole file to discard all lines but the first one. I would recommend using .take(1) on the rdd.

first_line = sc.textFile('test.csv').take(1)

This second solution is a lot faster on long files.

Also note that your filter function does not currently serve any purpose, you could just leave out .map(filter).