shakedzy shakedzy - 4 years ago 95
Scala Question

Spark: computationally efficient way to compare dates?

I have a huge data set which needs to be filtered by date (dates are stored as yyyy-MM-dd format). Which of the following options is the most computationally efficient way to do that (and why)?

df.filter("unix_timestamp(dt_column,'yyyy-MM-dd') >= unix_timestamp('2017-02-03','yyyy-MM-dd')")


OR

df.filter("cast(dt_column as date) >= cast('2017-02-03' as date)")

Answer Source

As dt_column is already in yyyy-MM-dd no need to cast/unix_timestamp it again. Internally spark does lexicographic comparison with Strings only for all date types (As of Spark 2.1). There won't be any date type at low level when comparison happens.

Now cast('2017-02-03' as date) and unix_timestamp('2017-02-03','yyyy-MM-dd') may not cause performance issue as it's constant. I'd recommand you to use DataSet functions to catch syntax issues at compile time

//These two should be the same
df.filter(df("dt_column") >= lit("2017-02-03"))

df.filter(df("dt_column") >= lit("2017-02-03").cast(DataTypes.DateType))

cast and unix_timestamp both generates dates from strings but unix_timestamp gives options to create date in diff formats. Apart from that, there shouldn't be any diff in terms of performance.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download