lazywiz lazywiz - 2 months ago 15
Scala Question

Map for each value of an array in a Spark Row

I have a json data set in the following format, one entry per line.

{ "sales_person_name" : "John", "products" : ["apple", "mango", "guava"]}
{ "sales_person_name" : "Tom", "products" : ["mango", "orange"]}
{ "sales_person_name" : "John", "products" : ["apple", "banana"]}
{ "sales_person_name" : "Steve", "products" : ["apple", "mango"]}
{ "sales_person_name" : "Tom", "products" : ["mango", "guava"]}


I want to know who sold maximum mangoes and so on.
Hence I want to load the file to dataframe and emit a (key, value) pair of (product, name) for each value of product in the array for each transaction.

var df = spark.read.json("s3n://sales-data.json")
df.printSchema()
root
|-- sales_person_name: string (nullable = true)
|-- products: array (nullable = true)

var nameProductsMap = df.select("sales_person_name", "products").show()
+-----------------+--------------------+
|sales_person_name| products |
+-----------------+--------------------+
| John|[mango, apple,... |
| Tom|[mango, orange,... |
| John|[apple, banana... |

var resultMap = df.select("products", "sales_person_name")
.map(r => (r(1), r(0)))
.show() //This is where I am stuck.


I am not able to figure out the right way to explode() the row(0) and have all its values emitted once with row(1) value. Can anyone suggest a way. Thanks!

Answer
val exploded = df.explode("products", "product") { a: mutable.WrappedArray[String] => a }
val result = exploded.drop("products")
result.show()

prints:

+-----------------+-------+
|sales_person_name|product|
+-----------------+-------+
|             John|  apple|
|             John|  mango|
|             John|  guava|
|              Tom|  mango|
|              Tom| orange|
|             John|  apple|
|             John| banana|
|            Steve|  apple|
|            Steve|  mango|
|              Tom|  mango|
|              Tom|  guava|
+-----------------+-------+
Comments