genomics-geek genomics-geek - 6 days ago 5
Python Question

PySpark can't convert RDD of dicts to DataFrame. Error: can not accept object in type <class 'pyspark.sql.types.Row'>

I am currently using

Spark 1.4.1
and can't convert a dict with nested dict to a
Spark DataFrame
. I convert the nested
dict
to a
Row
, but it seems to not accept my schema.

Here is the code to reproduce my error:

from pyspark.sql import Row, SQLContext, types as pst
sqlContext = SQLContext(sc)

example_dict = Row(**{"name": "Mike", "data": Row(**{"age": 10, "like": True})})

example_rdd = sc.parallelize([example_dict])

nested_fields = [pst.StructField("age", pst.IntegerType(), True),
pst.StructField("like", pst.BooleanType(), True)]

schema = pst.StructType([
pst.StructField("data", pst.StructType(nested_fields), True),
pst.StructField("name", pst.StringType(), True)
])

df = sqlContext.createDataFrame(example_rdd, schema)

TypeError: StructType(List(StructField(age,IntegerType,true),StructField(like,BooleanType,true))) can not accept object in type <class 'pyspark.sql.types.Row'>


I am not sure why I receive this error. Here are the
objects rdd
and
schema
:

>>> example_rdd.first()
Row(data=Row(age=10, like=True), name='Mike')

>>> schema
StructType(List(StructField(data,StructType(List(StructField(age,IntegerType,true),StructField(like,BooleanType,true))),true),StructField(name,StringType,true)))


I am not sure if I am missing something, but it appears that the schema matches the object. Is there a reason why
Spark 1.4.1
will not accept Row within a Row?

As a note: this is not an issue in
Spark 2.0.2
, but unfortunately I am on a shared resource using
Spark 1.4.1
, so I need to find a work around for the time being :(. Any help would be appreciated, thanks in advance!

Answer

This happens because Row is not accepted as StructType in Spark 1.4. Accepted types are:

pst._acceptable_types[pst.StructType]
(tuple, list)

and Spark makes a naive check:

type(obj) not in _acceptable_types[_type]

which obviously won't work for Row object. Correct condition, which is equivalent to what happens in the current version, would be:

isinstance(obj, _acceptable_types[_type])

If you want to use nested columns you can use plain Python tuple:

Row(**{"name": "Mike", "data": (10, True)})

or

((10, True), "Mike")
Comments