prakash prakash - 1 year ago 74
Scala Question

How to get data of previous row in Apache Spark

Find previous month sale of each city from Spark Data frame

|City| Month |Sale|
+----+----------- +----- +
| c1| JAN-2017| 49 |
| c1| FEB-2017| 46 |
| c1| MAR-2017| 83 |
| c2| JAN-2017| 59 |
| c2| MAY-2017| 60 |
| c2| JUN-2017| 49 |
| c2| JUL-2017| 73 |
+----+-----+----+-------


Required solution is

|City| Month |Sale |previous_sale|
+----+-----+-------+-------------+--------
| c1| JAN-2017| 49| NULL |
| c1| FEB-2017| 46| 49 |
| c1| MAR-2017| 83| 46 |
| c2| JAN-2017| 59| NULL |
| c2| MAY-2017| 60| 59 |
| c2| JUN-2017| 49| 60 |
| c2| JUL-2017| 73| 49 |
+----+-----+----+-------------+-----------


Please help me

Answer Source

You can use lag function to get the previous value

If you want to sort by month you need to convert to proper date. For "JAN-2017" to "01-01-2017" something like this.

import spark.implicits._
val df = spark.sparkContext.parallelize(Seq(
  ("c1", "JAN-2017", 49),
("c1", "FEB-2017", 46),
("c1", "MAR-2017", 83),
("c2", "JAN-2017", 59),
("c2", "MAY-2017", 60),
("c2", "JUN-2017", 49),
("c2", "JUL-2017", 73)
)).toDF("city", "month", "sales")

val window = Window.partitionBy("city").orderBy("month")

df.withColumn("previous_sale", lag($"sales", 1, null).over(window)).show

Output:

+----+--------+-----+----+
|city|   month|sales| previous_sale|
+----+--------+-----+----+
|  c1|FEB-2017|   46|null|
|  c1|JAN-2017|   49|  46|
|  c1|MAR-2017|   83|  49|
|  c2|JAN-2017|   59|null|
|  c2|JUL-2017|   73|  59|
|  c2|JUN-2017|   49|  73|
|  c2|MAY-2017|   60|  49|
+----+--------+-----+----+

You can use this UDF to create a default date like 01/month/year which will be used so sort with date even if it has different year

val fullDate = udf((value :String )=>
{
  val months = List("JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")
  val splited = value.split("-")
  new Date(splited(1).toInt, months.indexOf(splited(0)) + 1, 1)
})

df.withColumn("month", fullDate($"month")).show()

Hope this hepls!

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download