user1411335 user1411335 - 4 years ago 104
JSON Question

strings getting converted to null when writing JSON representation of RDD

I am trying to write RDD which is structure like

(int , ListofList , ListofListofList)

Something like this

(49807360, [[111206019,'ABC','XYZ:RDC' , 'RDC' , 123] , [111206019,'ABC','XYZ:RDC' , 'RDC' , 123]] , [[[111206019,'ABC','XYZ:RDC' , 'RDC' , 123] , 111206019,'ABC','XYZ:RDC' , 'RDC' , 123]] , [[111206019,'ABC','XYZ:RDC' , 'RDC' , 123],[111206019,'ABC','XYZ:RDC' , 'RDC' , 123]])


When I print this is RDD form I see the data correctly. When I used inbuilt library to write it in JSON format I am getting null values in place of strings.

{"user":49807360,"history":[[111206019,null,null,null,123], [111206019,null,null,null,123]],"collection":...}


The line of code I am using to serialize RDD to JSON is

rdd.toDF().toJSON().saveAsTextFile(ouput_file_path)

I have also tried

rdd.toDF().write.json(ouput_file_path,"overwrite","gzip")


Above code was run in spark version 2.0.0

Answer Source

This happens because you use DataFrame as an intermediate step. Spark SQL doesn't support heterogeneous arrays, so values which don't match inferred type (array<bigint>) are replaced by NULL.

If you really want to go this way, and support heterogeneous structures, you should use tuples which should be mapped to Spark SQL structs, or don't depend on schema inference, and provide desired schema explicitly:

schema = ...  # type: StructType
spark.createDataFrame(rdd, schema)

with schema (JSON representation) similar to:

{'fields': [{'metadata': {}, 'name': '_1', 'nullable': True, 'type': 'long'},
  {'metadata': {},
   'name': '_2',
   'nullable': True,
   'type': {'containsNull': True,
    'elementType': {'fields': [{'metadata': {},
       'name': '_1',
       'nullable': True,
       'type': 'long'},
      {'metadata': {}, 'name': '_2', 'nullable': True, 'type': 'string'},
      {'metadata': {}, 'name': '_3', 'nullable': True, 'type': 'string'},
      {'metadata': {}, 'name': '_4', 'nullable': True, 'type': 'string'},
      {'metadata': {}, 'name': '_5', 'nullable': True, 'type': 'long'}],
     'type': 'struct'},
    'type': 'array'}},
  {'metadata': {},
   'name': '_3',
   'nullable': True,
   'type': {'fields': [{'metadata': {},
      'name': '_1',
      'nullable': True,
      'type': {'fields': [{'metadata': {},
         'name': '_1',
         'nullable': True,
         'type': 'long'},
        {'metadata': {}, 'name': '_2', 'nullable': True, 'type': 'string'},
        {'metadata': {}, 'name': '_3', 'nullable': True, 'type': 'string'},
        {'metadata': {}, 'name': '_4', 'nullable': True, 'type': 'string'},
        {'metadata': {}, 'name': '_5', 'nullable': True, 'type': 'long'}],
       'type': 'struct'}},
     {'metadata': {}, 'name': '_2', 'nullable': True, 'type': 'long'},
     {'metadata': {}, 'name': '_3', 'nullable': True, 'type': 'string'},
     {'metadata': {}, 'name': '_4', 'nullable': True, 'type': 'string'},
     {'metadata': {}, 'name': '_5', 'nullable': True, 'type': 'string'},
     {'metadata': {}, 'name': '_6', 'nullable': True, 'type': 'long'}],
    'type': 'struct'}},
  {'metadata': {},
   'name': '_4',
   'nullable': True,
   'type': {'containsNull': True,
    'elementType': {'fields': [{'metadata': {},
       'name': '_1',
       'nullable': True,
       'type': 'long'},
      {'metadata': {}, 'name': '_2', 'nullable': True, 'type': 'string'},
      {'metadata': {}, 'name': '_3', 'nullable': True, 'type': 'string'},
      {'metadata': {}, 'name': '_4', 'nullable': True, 'type': 'string'},
      {'metadata': {}, 'name': '_5', 'nullable': True, 'type': 'long'}],
     'type': 'struct'},
    'type': 'array'}}],
 'type': 'struct'}
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download