Sam Elamin Sam Elamin - 1 year ago 120
JSON Question

Json spark to redshift

I have a file that has multiple nested jsons, I would like topush to redshift, i figured to start off with I can just save the entire row as one column string then save

I am new to both spark and scala so excuse the ignorance please, but it seems that reading the file as json will automatically infer a schema which I cant save to redshift because the connector im using cant save the schema for nested jsons

So my question is this, how can I read in a file that has a nested json, save each row completely in a table with one column "message"

If it helps here is a sample json that I want to break down, maybe to different tables?

Assuming I have a sample json as

"OrderId": "foo",
"LegacyOrderId": 123,
"NotifiyCustomer": true,
"OrderResolutionStatus": "OkByBox",
"CustomerId": 123,
"RestaurantId": 132111,
"OrderContainer": {
"Id": "foo",
"LegacyId": 123,
"ApplicationInfo": {
"UserAgent": "Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13F69 Safari/601.1",
"ClientIp": "",
"JeFeature": "sadsad",
"ApplicationName": "Resadasdsab",
"ApplicationVersion": "1.0.10444.0"
"Order": {
"foo": "Fasdsaasdsdse. ",
"foo": "asdasdsads",
"fooDate": "2016-07-06T10:58:40.3579409+00:00",
"InitialDueDate": "2016-07-06T12:10:00+00:00",
"InitialDueDateWithUtcOffset": "2016-07-06T13:10:00+01:00",
"DueDate": "2016-07-06T12:10:00+00:00",
"DueDateWithUtcOffset": "2016-07-06T13:10:00+01:00",
"foo": true,
"foo": true
"RestaurantInfo": {
"Id": "34421159",
"Name": "Casdadasdase",
"PhoneNumber": "0fdsfddsf2",
"AddressLines": ["5asdasddasad"],
"City": "London",
"Postcode": "SE1 3BR",
"Longitude": 0.012155,
"Latitude": 51.920357,
"DispatchMethod": "JctV3",
"SeoName": "caadssadasdas6",
"TempOffline": false,
"Offline": false
"PaymentInfo": {
"OrderId": "foo",
"PaymentLines": [{
"Type": "Cash",
"Value": 18.8,
"CardFee": 0.0
"DriverTipValue": 0.0,
"Total": 18.8,
"TotalComplementary": 0.0,
"PaidDate": "2016-07-06T10:58:51.7863535+00:00"
"CustomerInfo": {
"Id": "7604asdsa879",
"Email": "",
"Name": "Lysadsadasi",
"Address": "3sadsadsadsdsaasde",
"City": "London",
"Postcode": "E11 3sG",
"PhoneNumber": "asdsads",
"TimeZone": "GMT Standard Time",
"PreviousJeOrderCount": 22,
"PreviousRestuarantOrderCount": 0
"BasketInfo": {
"BasketId": "asdsdas",
"MenuId": 119949,
"Items": [{
"ProductId": 1392310,
"ProductTypeId": 366,
"MenuCardNumber": "96",
"Name": "Casdasdsaef",
"Description": "In batter<br />\r\n<strong>Spicy</strong>",
"Synonym": "",
"UnitPrice": 52.5,
"CombinedPrice": 52.5,
"MealParts": [],
"OptionalAccessories": [],
"RequiredAccessories": [],
"Discounts": [],
"MultiBuyDiscounts": []
}, {
"ProductId": 12233,
"ProductTypeId": 44218,
"MenuCardNumber": "418",
"Name": "Aasdsaddsadsadsadsa",
"Description": "",
"Synonym": "",
"UnitPrice": 7.5,
"CombinedPrice": 7.5,
"MealParts": [],
"OptionalAccessories": [],
"RequiredAccessories": [],
"Discounts": [],
"MultiBuyDiscounts": []
}, {
"ProductId": 13917482,
"ProductTypeId": 365,
"MenuCardNumber": "129",
"Name": "Szasdsadsaadssad",
"Description": "asdsaddsadssaddas",
"Synonym": "",
"UnitPrice": 5.8,
"CombinedPrice": 5.8,
"MealParts": [],
"OptionalAccessories": [],
"RequiredAccessories": [],
"Discounts": [],
"MultiBuyDiscounts": []
"Discounts": [],
"SubTotal": 18.8,
"ToSpend": 0.0,
"MultiBuyDiscount": 0.0,
"Discount": 0.0,
"DeliveryCharge": 0.0,
"Total": 18.8
"IsLocked": true
"Id": "c35d48c1-268f-4e46-81d9-700afc22f2d9",
"TimeStamp": "2016-07-06T10:59:22.982145Z",
"RaisingComponent": "Leasdsadasdasw",
"Tenant": "uk"

Answer Source

You're correct that Redshift requires the JSON structure to be isomorphic to the top level keys in the JSON source. There are a number of other limitations, e.g., can't load nested arrays, can't load more than 2MB per file, etc.

My suggestion (and the solution I have just implemented personally) is:

  • Load the data into a Spark dataframe"com.databricks.spark.csv")
  • Let Spark infer the schema if needed .option("inferSchema", "true")
  • Manipulate it as a sequence of dataframes val nextDf = df.[filter|explode|select|…etc]
  • Write the final flat dataframe(s) to Redshift finalDf.write.format("com.databricks.spark.redshift").option(…
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download