Dmitry Polonskiy Dmitry Polonskiy - 23 days ago 7
Python Question

Reducing a DateTime Object in PySpark

I have two DFs. One has datetime as

DATE=datetime.date(2014, 2, 1)
and another one which has datetime as
pickup_time=datetime.datetime(2014, 2, 9, 14, 51)
. The problem is that I am unable to join the two DataFrames due to the fact that one has the hour/minutes/seconds so PySpark is unable to join them due to that. Is the correct method to reformat the datetime in the dataframe with the extra time format, or is there a way to join the DataFrames which disregards the hours/minutes/seconds. How would I go about doing this?

Answer

You can cast types during join, for example:

>>> df1.first();
Row(date=datetime.date(2016, 11, 11))
>>> df2.first();
Row(date=datetime.datetime(2016, 11, 11, 21, 8))
>>> df1.join(df2, df1.date == df2.date.cast('date')).first()
Row(date=datetime.date(2016, 11, 11), date=datetime.datetime(2016, 11, 11, 21, 8))
Comments