Georg Heiler Georg Heiler - 20 days ago 10
Scala Question

Spark time difference

From spark 1.5.x on some convenience methods were added to deal with time.

df.select(datediff($"A", $"B")).show


But this returns the difference in days. Is there any option to convert it to a different interval e.g. years?

Manually I used

df.select(year($"A")- year($"B")).show


which is good enough. But it would be better if the date difference was calculated more accurately e.g. including months / days like displayed http://howtodoinjava.com/for-fun-only/java-code-to-calculate-age-from-date-of-birth/

Is a UDF the go-to solution for current spark 2.0.2? or Am I missing an option to convert the difference as days to e.g. years?

Answer

Timestamp type can be converted to its the Unix timestamp in seconds. If this resolution is good enough for you the rest is just a bunch of arithmetic operations. You can compute difference in seconds as:

import org.apache.spark.sql.Column

def secondsBetween(col1: Column, col2: Column) =
  col2.cast("timestamp").cast("bigint") - col1.cast("timestamp").cast("bigint")

and re-sample according to your needs, for example:

def minutesBetween(col1: Column, col2: Column) =
  (secondsBetween(col1, col2) / 60).cast("bigint")

Obviously this doesn't handle all the subtleties of date and time processing like daylight saving time or leap second so if these are important in your domain then UDF with proper time processing library is recommended.

Comments