PuRaK PuRaK - 21 days ago 4
JSON Question

Parsing complex nested JSON in Pig

I want to parse a Billionaires JSON dataset into Pig.The JSON file can be found here.

Here is what each entry has:

{
"wealth": {
"worth in billions": 1.2,
"how": {
"category": "Resource Related",
"from emerging": true,
"industry": "Mining and metals",
"was political": false,
"inherited": true,
"was founder": true
},
"type": "privatized and resources"
},
"company": {
"sector": "aluminum",
"founded": 1993,
"type": "privatization",
"name": "Guangdong Dongyangguang Aluminum",
"relationship": "owner"
},
"rank": 1372,
"location": {
"gdp": 0.0,
"region": "East Asia",
"citizenship": "China",
"country code": "CHN"
},
"year": 2014,
"demographics": {
"gender": "male",
"age": 50
},
"name": "Zhang Zhongneng"
}


Attempt 1

I tried loading this data using the following command in grunt :


billionaires = LOAD 'billionaires.json' USING JsonLoader('wealth:
(worth in billions:double, how: (category:chararray, from
emerging:chararray, industry:chararray, was political:chararray,
inherited:chararray, was founder:chararray), type:chararray), company:
(sector:chararray,founded:int,type:chararray,name:chararray,relationship:chararray),rank:int,location:(gdp:double,region:chararray,citizenship:chararray,country
code:chararray), year:int, demographics: (gender:chararray,age:int),
name:chararray');


This however gives me the error:


ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: mismatched input 'in' expecting RIGHT_PAREN


Attempt 2

Next I tried using Twitter's elephantbird project's loader called
com.twitter.elephantbird.pig.load.JsonLoader
. Here is the code for this UDF. This is what I did:

billionaires = LOAD 'billionaires.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);
names = foreach billionaires generate json#'name' AS name;
dump names;


Now this runs and I get no errors! But nothing gets displayed. I get an output like:


Input(s): Successfully read 0 records (1445335 bytes) from:
"hdfs://localhost:9000/user/purak/billionaires.json"

Output(s): Successfully stored 0 records in:
"hdfs://localhost:9000/tmp/temp-1399280624/tmp-477607570"

Counters: Total records written : 0 Total bytes written : 0 Spillable
Memory Manager spill count : 0 Total bags proactively spilled: 0 Total
records proactively spilled: 0

Job DAG: job_1478889184960_0005


What am I doing wrong here?

Answer

This is probably not the best way of doing this but this is what I end up doing :

  1. Remove spaces from the field names : I replaced fields like "worth in billions", "from emerging" etc with "worth_in_billions", "from_emerging" etc in the json dataset.

  2. Comma delimited json to newline delimited json : The json file that I had was of the form [{"_comment":"first entry" ...},{"_comment":"second entry" ...}]. But JsonLoader in Pig takes each newline to be a new entry. To make the json file newline delimited instead of comma I used js which is a command-line JSON processor. Install it using sudo apt-get install js and run cat billionaires.json | jq -c ".[]" > newBillionaires.json.

  3. The newBillionaires.json file now has each entry on new line. Now load this file into Pig using:

    copyFromLocal /home/purak/Desktop/newBillionaires.json /user/purak

billionaires = LOAD 'newBillionaires.json' USING JsonLoader('name:chararray, demographics: (age:int,gender:chararray),year:int,location:(country_code:chararray,citizenship:chararray,region:chararray,gdp:double),rank:int,company: (relationship:chararray,name:chararray,type:chararray,founded:int,sector:chararray), wealth:(type:chararray,how:(was_founder:chararray,inherited:chararray,was_political:chararray,industry:chararray, from_emerging:chararray,category:chararray),worth_in_biilions:double)');

Note: Using js reversed the order of fields in each entry. Hence in the load command all fields are in a reversed order compared to the load command in the question.

  1. You can now unnest each tuples using :

billionairesFinal = foreach billionaires generate name, demographics.age as age, demographics.gender as gender, year, location.country_code as countryCode, location.citizenship as citizenship, location.region as region, location.gdp as gdp, rank, company.relationship as companyRelationship, company.name as companyName, company.type as companyType, company.founded as companyFounded, company.sector as companySector, wealth.type as wealthType, wealth.how.was_founder as wasFounder, wealth.how.inherited as inherited, wealth.how.was_political as wasPolitical, wealth.how.industry as industry, wealth.how.from_emerging as fromEmerging, wealth.how.category as category, wealth.worth_in_biilions as worthInBillions;

  1. Check the structure once using describe billionairesFinal; :

billionairesFinal: {name: chararray,age: int,gender: chararray,year: int,countryCode: chararray,citizenship: chararray,region: chararray,gdp: double,rank: int,companyRelationship: chararray,companyName: chararray,companyType: chararray,companyFounded: int,companySector: chararray,wealthType: chararray,wasFounder: chararray,inherited: chararray,wasPolitical: chararray,industry: chararray,fromEmerging: chararray,category: chararray,worthInBillions: double}

This was the intended structure of data that I wanted in Pig! Now I can go on and analyse the dataset :)