user2966197 user2966197 - 4 months ago 19
Python Question

form a list of strings from the header of a csv file in pyspark

I am trying to form a list of strings from the header of a csv file in pyspark. The header in csv file is in unicode format. I wrote this code which reads the header but it doesn't form the list with individual values from the header:

def filter(line):

return line

read_file = sc.textFile('file:///file1.csv').zipWithIndex().filter(lambda (line, rownum): rownum == 0).map(lambda (line, rownum): line)


data = (read_file
.map(lambda line: line.split(","))
.filter(lambda line: len(line) >= 1)
.map(filter))

print data.collect()


The output I see looks like this:

[[u'header1', u'header2', u'header3', u'header4', u'header5']]


while I want it to be
['header1','header2','header3','header4','header5']


How can I correct it and form the list?

H2O H2O
Answer

Easy enough to fix your specific problem: just use flatMap instead of map

data = read_file.flatMap(lambda l: l.split(","))

Obviously taking [0] of the result as in data.collect()[0] is also a solution.

However the way you currently doing it you are iterating over the whole file to discard all lines but the first one. I would recommend using .take(1) on the rdd.

first_line = sc.textFile('test.csv').take(1)
first_line[0].split(",")

This second solution is a lot faster on long files.

Also note that your filter function does not currently serve any purpose, you could just leave out .map(filter).