a.moussa a.moussa - 3 months ago 38
Python Question

Convert RDD of LabeledPoint to DataFrame toDF() Error

I have a dataframe df which contains 13 values separated with comma. I want to get in df2 a dataFrame wich contains labeledPoint. firt value is label, twelve others are features. I use a split and select method to divide string with 13 value into an array of 13 values. map method allow me to create labeledPoint. Error come when i use toDF() method to convert RDD to DataFrame

df2 = df.select(split(df[0], ',')).map(lambda x: LabeledPoint(float(x[0]),x[-12:])).toDF()


org.apache.spark.SparkException: Job aborted due to stage failure:

when I look in the stackerror I find:
IndexError: tuple index out of range.

in order to do test, I executed :

display(df.select(split(df[0], ',')))


i obtain my 13 values in an array for each row:

["2001.0","0.884123733793","0.610454259079","0.600498416968","0.474669212493","0.247232680947","0.357306088914","0.344136412234","0.339641227335","0.600858840135","0.425704689024","0.60491501652","0.419193351817"]


any Idea?

Answer

The Error come from the index x[0] should be replace by x[0][0]. So :

df2 = df.select(split(df[0], ',')).map(lambda x: LabeledPoint(float(x[0][0]), x[0][-12:])).toDF()