LearningSlowly LearningSlowly - 3 months ago 193
JSON Question

Reading JSON with Apache Spark - `corrupt_record`

I have a

json
file,
nodes
that looks like this:

[{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}
,{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}
,{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}
,{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}]


I am able to read and manipulate this record with Python.

I am trying to read this file in
scala
through the
spark-shell
.

From this tutorial, I can see that it is possible to read
json
via
sqlContext.read.json


val vfile = sqlContext.read.json("path/to/file/nodes.json")


However, this results in a
corrupt_record
error:

vfile: org.apache.spark.sql.DataFrame = [_corrupt_record: string]


Can anyone shed some light on this error? I can read and use the file with other applications and I am confident it is not corrupt and sound
json
.

Answer

Spark cannot read JSON-array to a record on top-level, so you have to pass:

{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1} 
{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2} 
{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3} 
{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}

As it's described in the tutorial you're referring to:

Let's begin by loading a JSON file, where each line is a JSON object

The reasoning is quite simple. Spark expects you to pass a file with a lot of JSON-entities, so it could distribute their processing (per entity, roughly saying). So that's why it expects to parse an entity on top-level but gets an array, which is impossible to map to a record as there is no name for such column. Basically (but not precisely) Spark is seeing your array as one row with one column and fails to find a name for that column.

To put more light on it, here is a quote form the official doc

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

This format is often called JSONL. Basically it's an alternative to CSV.

Comments