Justin Nafe Justin Nafe - 1 month ago 14
Java Question

How do I create a schema for Hive to parse deeply nested json (Azure Application Insights output) using SerDe?

I'm trying to create a schema for hive to parse json, however, I am having trouble creating the schema when the json doc is in the following structure:

{
"context": {
"custom": {
"dimensions": [{
"action": "GetFilters"
},
{
"userId": "12345678"
}]
}
}
}


I am using the Hadoop emulator for Azure's HDInsights on windows (8.1) and am using java (1.8.0_73). I compiled the SerDe successfully with Maven. I would think that the following would work:

add jar ../lib/json-serde-1.1.9.9-Hive1.2-jar-with-dependencies.jar;

DROP TABLE events;
CREATE EXTERNAL TABLE events (
context STRUCT<custom:STRUCT<dimensions:array<STRUCT<action:string>,STRUCT<userId:string>>>>
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/json/event';


When I take out the nested ARRAY>, then the schema parses ok, but with it in, I get the following exception:


MismatchedTokenException(282!=9)
at org.antlr.runtime.BaseRecognizer.recoverFromMismatchedToken(BaseRecog
nizer.java:617)
at org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:115)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonType(HivePa
rser.java:34909)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonTypeList(Hi
veParser.java:33113)
at org.apache.hadoop.hive.ql.parse.HiveParser.structType(HiveParser.java
:36331)
at org.apache.hadoop.hive.ql.parse.HiveParser.type(HiveParser.java:35334
)
at org.apache.hadoop.hive.ql.parse.HiveParser.colType(HiveParser.java:35
054)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonType(HivePa
rser.java:34914)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameColonTypeList(Hi
veParser.java:33085)
at org.apache.hadoop.hive.ql.parse.HiveParser.structType(HiveParser.java
:36331)
at org.apache.hadoop.hive.ql.parse.HiveParser.type(HiveParser.java:35334
)
at org.apache.hadoop.hive.ql.parse.HiveParser.colType(HiveParser.java:35
054)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameType(HiveParser.
java:34754)
at org.apache.hadoop.hive.ql.parse.HiveParser.columnNameTypeList(HivePar
ser.java:32951)
at org.apache.hadoop.hive.ql.parse.HiveParser.createTableStatement(HiveP
arser.java:4544)
at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.ja
va:2144)
at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.j
ava:1398)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:
1036)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:19
9)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:16
6)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:409)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:323)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:980)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1045)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:916)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:906)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:2
68)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793
)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
FAILED: ParseException line 2:69 missing > at ',' near 'STRUCT' in column specif
ication
line 2:76 mismatched input '<' expecting : near 'STRUCT' in column specification

hive>


Answer

I ended up getting it to work by removing the nested STRUCTs in the ARRAY STRUCT, but I have to access the values with [#]. For example, the following builds the schema:

DROP TABLE events;
CREATE EXTERNAL TABLE events (
  context STRUCT<custom:STRUCT<dimensions:ARRAY<STRUCT<action:string,userId:string>>>>
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/json/event';

Then I can access items such as the userId like so:

SELECT context.custom.dimensions.userId[1] FROM events;

It works, but is not as readable as I would prefer.