a.moussa a.moussa - 1 year ago 170
Python Question

Convert RDD of LabeledPoint to DataFrame toDF() Error

I have a dataframe df which contains 13 values separated with comma. I want to get in df2 a dataFrame wich contains labeledPoint. firt value is label, twelve others are features. I use a split and select method to divide string with 13 value into an array of 13 values. map method allow me to create labeledPoint. Error come when i use toDF() method to convert RDD to DataFrame

df2 = df.select(split(df[0], ',')).map(lambda x: LabeledPoint(float(x[0]),x[-12:])).toDF()

org.apache.spark.SparkException: Job aborted due to stage failure:

when I look in the stackerror I find:
IndexError: tuple index out of range.

in order to do test, I executed :

display(df.select(split(df[0], ',')))

i obtain my 13 values in an array for each row:


any Idea?

Answer Source

The Error come from the index x[0] should be replace by x[0][0]. So :

df2 = df.select(split(df[0], ',')).map(lambda x: LabeledPoint(float(x[0][0]), x[0][-12:])).toDF()
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download