fireants fireants - 4 months ago 24
JSON Question

Creating an aggregate metrics from JSON logs in apache spark

I am getting started with apache spark.
I have a requirement to convert a json log to a flattened metrics, can be considered as a simple csv as well.

For eg.

"orderId":1,
"orderData": {
"customerId": 123,
"orders": [
{
"itemCount": 2,
"items": [
{
"quantity": 1,
"price": 315
},
{
"quantity": 2,
"price": 300
},

]
}
]
}


This can be considered as a single json log, I want to convert this into,

orderId,customerId,totalValue,units
1 , 123 , 915 , 3


I was going through sparkSQL documentation and can use it to get hold of individual values like "select orderId,orderData.customerId from Order" but I am not sure how to get the summation of all the prices and units.

What should be the best practice to get this done using apache spark?

Answer

Try:

>>> from pyspark.sql.functions import *
>>> doc = {"orderData": {"orders": [{"items": [{"quantity": 1, "price": 315}, {"quantity": 2, "price": 300}], "itemCount": 2}], "customerId": 123}, "orderId": 1}
>>> df = sqlContext.read.json(sc.parallelize([doc]))
>>> df.select("orderId", "orderData.customerId", explode("orderData.orders").alias("order")) \
... .withColumn("item", explode("order.items")) \
... .groupBy("orderId", "customerId") \
... .agg(sum("item.quantity"), sum(col("item.quantity") * col("item.price")))