I am trying to write RDD which is structure like
(int , ListofList , ListofListofList)
Something like this
(49807360, [[111206019,'ABC','XYZ:RDC' , 'RDC' , 123] , [111206019,'ABC','XYZ:RDC' , 'RDC' , 123]] , [[[111206019,'ABC','XYZ:RDC' , 'RDC' , 123] , 111206019,'ABC','XYZ:RDC' , 'RDC' , 123]] , [[111206019,'ABC','XYZ:RDC' , 'RDC' , 123],[111206019,'ABC','XYZ:RDC' , 'RDC' , 123]])
{"user":49807360,"history":[[111206019,null,null,null,123], [111206019,null,null,null,123]],"collection":...}
rdd.toDF().write.json(ouput_file_path,"overwrite","gzip")
This happens because you use DataFrame
as an intermediate step. Spark SQL doesn't support heterogeneous arrays, so values which don't match inferred type (array<bigint>
) are replaced by NULL
.
If you really want to go this way, and support heterogeneous structures, you should use tuples
which should be mapped to Spark SQL structs
, or don't depend on schema inference, and provide desired schema explicitly:
schema = ... # type: StructType
spark.createDataFrame(rdd, schema)
with schema (JSON representation) similar to:
{'fields': [{'metadata': {}, 'name': '_1', 'nullable': True, 'type': 'long'},
{'metadata': {},
'name': '_2',
'nullable': True,
'type': {'containsNull': True,
'elementType': {'fields': [{'metadata': {},
'name': '_1',
'nullable': True,
'type': 'long'},
{'metadata': {}, 'name': '_2', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_3', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_4', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_5', 'nullable': True, 'type': 'long'}],
'type': 'struct'},
'type': 'array'}},
{'metadata': {},
'name': '_3',
'nullable': True,
'type': {'fields': [{'metadata': {},
'name': '_1',
'nullable': True,
'type': {'fields': [{'metadata': {},
'name': '_1',
'nullable': True,
'type': 'long'},
{'metadata': {}, 'name': '_2', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_3', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_4', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_5', 'nullable': True, 'type': 'long'}],
'type': 'struct'}},
{'metadata': {}, 'name': '_2', 'nullable': True, 'type': 'long'},
{'metadata': {}, 'name': '_3', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_4', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_5', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_6', 'nullable': True, 'type': 'long'}],
'type': 'struct'}},
{'metadata': {},
'name': '_4',
'nullable': True,
'type': {'containsNull': True,
'elementType': {'fields': [{'metadata': {},
'name': '_1',
'nullable': True,
'type': 'long'},
{'metadata': {}, 'name': '_2', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_3', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_4', 'nullable': True, 'type': 'string'},
{'metadata': {}, 'name': '_5', 'nullable': True, 'type': 'long'}],
'type': 'struct'},
'type': 'array'}}],
'type': 'struct'}