SSC SSC - 11 days ago 5
Scala Question

Spark 1.6 scala create data rows

I have the following code.

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val baseDF = sqlContext.read.json(fileFullPath)


My json has 2 fields of interest: ProductId and Quantity. What I am looking for

{
"sales": {
"saledate": "17Mar2008",
"sale": [{
"productid": 1,
"quantity": 10
}, {
"productid": 2,
"quantity": 1
}, {
"productid": 3,
"quantity": 3
}, {
"productid": 4,
"quantity": 5
}]
}
}


I want to change this to an spark RDD or DF which has 2 columns, productid and quantity but multiple rows based on quantity. I want 1 for each quantity.

In above example product 1 has 10 rows, product 2 has 1, product 3 has 3 and product 4 has 5 rows for a total of 19 rows i.e. # rows = sum(quantity).

Any help appreciated. I am using spark 1.6.2 and scala.

Answer

This should do the thing:

import org.apache.spark.sql.functions._

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._

val baseDF = sqlContext.read.json(fileFullPath)
val listFromQuantity = udf { quantity: Int => List.fill(quantity)(quantity) }

baseDF.select(explode($"sales.sale")).select($"col.productId", explode(listFromQuantity($"col.quantity"))).show()

Which returns:

+---------+--------+
|productId|quantity|
+---------+--------+
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        2|       1|
|        3|       3|
|        3|       3|
|        3|       3|
|        4|       5|
|        4|       5|
|        4|       5|
|        4|       5|
|        4|       5|
+---------+--------+

If you'd like to have a single quantity in the second column (e.g. have value 1 instead of 5) you should replace List.fill(quantity)(quantity) with List.fill(quantity)(1)