I have a huge data set which needs to be filtered by date (dates are stored as yyyy-MM-dd format). Which of the following options is the most computationally efficient way to do that (and why)?
df.filter("unix_timestamp(dt_column,'yyyy-MM-dd') >= unix_timestamp('2017-02-03','yyyy-MM-dd')")
df.filter("cast(dt_column as date) >= cast('2017-02-03' as date)")
dt_column is already in
yyyy-MM-dd no need to
unix_timestamp it again. Internally spark does lexicographic comparison with Strings only for all date types (As of Spark 2.1). There won't be any
date type at low level when comparison happens.
cast('2017-02-03' as date) and
unix_timestamp('2017-02-03','yyyy-MM-dd') may not cause performance issue as it's constant. I'd recommand you to use DataSet functions to catch syntax issues at compile time
//These two should be the same df.filter(df("dt_column") >= lit("2017-02-03")) df.filter(df("dt_column") >= lit("2017-02-03").cast(DataTypes.DateType))
unix_timestampboth generates dates from strings but
unix_timestampgives options to create date in diff formats. Apart from that, there shouldn't be any diff in terms of performance.