asdvbnm asdvbnm - 1 month ago 7
JSON Question

JSON - compare a field across many records and resolve type discrepancies

In one record (JSON object), we have

{
"name": "ID",
"value": "4260567,4260556"
}


In another record, we have

{
"name": "ID",
"value": {}
}


My issue is that this dataset cannot be queried in Apache Drill because of the different types being used for one field - strings in one case and objects in others. I also can't replace {} with "" across the board, because there are other fields that are actually supposed to be objects that this would cause the same issue with.

My idea is to write some code to load in a batch of data, and go through field by field crosschecking records and correct these type mismatches. For example, if it sees that most of the entries in this ID field are strings and only some are empty objects, it'll convert those to empty strings. If it sees that some values are arrays/lists and other values in the same field are single objects, it'll convert those single objects to arrays/lists. Things like that.

However that's a pretty big undertaking for just one dataset. Are there any other ways for me to resolve this?

Answer

There is an experimental UNION setting you can try, exec.enable_union_type:

ALTER SESSION SET `exec.enable_union_type` = true;

Details

I'm using Drill 1.6. I used the following test data:

[
    {
        "name": "foo",
        "value": "4260567,4260556"
    },
    {
        "name": "bar",
        "value": {}
    },
    {
        "name": "baz",
        "value": ["one", "two", "three"]
    }
]

And ran a simple query

SELECT * FROM dfs.`/tmp/drill-sample.json`;

Which results in this error

Error: DATA_READ ERROR: Error parsing JSON - You tried to start when you are using a ValueWriter of type NullableVarCharWriterImpl.

File /tmp/drill-sample.json Record 2 Fragment 0:0

Fix

ALTER SESSION SET `exec.enable_union_type` = true;

Now the JSON parsing works

0: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/drill-sample.json`;
+-------+------------------------+
| name  |         value          |
+-------+------------------------+
| foo   | 4260567,4260556        |
| bar   | {}                     |
| baz   | ["one","two","three"]  |
+-------+------------------------+
3 rows selected (1.106 seconds)