lazywiz lazywiz - 1 year ago 87
Scala Question

Map for each value of an array in a Spark Row

I have a json data set in the following format, one entry per line.

{ "sales_person_name" : "John", "products" : ["apple", "mango", "guava"]}
{ "sales_person_name" : "Tom", "products" : ["mango", "orange"]}
{ "sales_person_name" : "John", "products" : ["apple", "banana"]}
{ "sales_person_name" : "Steve", "products" : ["apple", "mango"]}
{ "sales_person_name" : "Tom", "products" : ["mango", "guava"]}

I want to know who sold maximum mangoes and so on.
Hence I want to load the file to dataframe and emit a (key, value) pair of (product, name) for each value of product in the array for each transaction.

var df ="s3n://sales-data.json")
|-- sales_person_name: string (nullable = true)
|-- products: array (nullable = true)

var nameProductsMap ="sales_person_name", "products").show()
|sales_person_name| products |
| John|[mango, apple,... |
| Tom|[mango, orange,... |
| John|[apple, banana... |

var resultMap ="products", "sales_person_name")
.map(r => (r(1), r(0)))
.show() //This is where I am stuck.

I am not able to figure out the right way to explode() the row(0) and have all its values emitted once with row(1) value. Can anyone suggest a way. Thanks!

Answer Source
val exploded = df.explode("products", "product") { a: mutable.WrappedArray[String] => a }
val result = exploded.drop("products")


|             John|  apple|
|             John|  mango|
|             John|  guava|
|              Tom|  mango|
|              Tom| orange|
|             John|  apple|
|             John| banana|
|            Steve|  apple|
|            Steve|  mango|
|              Tom|  mango|
|              Tom|  guava|
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download