a.moussa a.moussa - 21 days ago 8
Scala Question

Sort Spark DataFrame's column by date

I have a DataFrame like that:

+---+-------------------+
|C1 | C2 |
+---+-------------------+
| A |21/12/2015-17-14-12|
| A |21/12/2011-20-12-44|
| B |11/02/2015-15-31-11|
| B |09/04/2015-13-38-05|
| C |11/06/2013-23-04-35|
+---+-------------------+


the second column is a timestamp dd/mm/yyyy-hh-mm-ss. I would like to sort each row like that

+---+-------------------+
|C1 | C2 |
+---+-------------------+
| A |21/12/2011-20-12-44|
| C |11/06/2013-23-04-35|
| B |11/02/2015-15-31-11|
| B |09/04/2015-13-38-05|
| A |21/12/2015-17-14-12|
+---+-------------------+


Perhaps I have to use an Udf?Do you have any idea?

Answer

This can be achieved in three steps:

  • conversion of your dates into timestamps (using first UDF),
  • sorting by timestamps,
  • conversion of timestamps back to your format (using second UDF).

You should try something like this:

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

val sparkSession = ...
import sparkSession.implicits._

val input = sc.parallelize(Seq(
  ("A", "21/12/2015-17-14-12"),
  ("A", "21/12/2011-20-12-44"),
  ("B", "11/02/2015-15-31-11"),
  ("B", "09/04/2015-13-38-05"),
  ("C", "11/06/2013-23-04-35")
)).toDF("C1", "C2")

val dateFormat = new SimpleDateFormat("dd/MM/yyyy-HH-mm-ss")

val customToTimestamp = udf { value: String =>
  val date = dateFormat.parse(value)

  new Timestamp(date.getTime)
}

val timestampToCustom = udf { value: Timestamp =>
  val date = new Date(value.getTime)

  dateFormat.format(date)
}

val output = input
  .select($"C1", customToTimestamp($"C2").cast("timestamp").as("C2"))
  .sort($"C2")
  .select($"C1", timestampToCustom($"C2").as("C2"))

output.show()

Which outputs:

+---+-------------------+
| C1|                 C2|
+---+-------------------+
|  A|21/12/2011-20-12-44|
|  C|11/06/2013-23-04-35|
|  B|11/02/2015-15-31-11|
|  B|09/04/2015-13-38-05|
|  A|21/12/2015-17-14-12|
+---+-------------------+
Comments