curtisp curtisp - 23 days ago 21
Python Question

Spark DataFrame TimestampType - how to get Year, Month, Day values from field?

I have Spark DataFrame with take(5) top rows as follows:

[Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55),
Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)]


It's schema is defined as:

elevDF.printSchema()

root
|-- date: timestamp (nullable = true)
|-- hour: long (nullable = true)
|-- value: double (nullable = true)


How do I get the Year, Month, Day values from the 'date' field?

Answer

You can use simple map as with any other RDD:

elevDF = sqlContext.createDataFrame(sc.parallelize([
        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55),
        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55),
        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55),
        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55),
        Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)]))

(elevDF
 .map(lambda (date, hour, value): (date.year, date.month, date.day))
 .collect())

and the result is:

[(1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1)]

Btw: datetime.datetime stores an hour anyway so keeping it separately seems to be a waste of memory.

Since Spark 1.5 you can use a number of date processing functions

import datetime
from pyspark.sql.functions import year, month, dayofmonth

elevDF = sc.parallelize([
    (datetime.datetime(1984, 1, 1, 0, 0), 1, 638.55),
    (datetime.datetime(1984, 1, 1, 0, 0), 2, 638.55),
    (datetime.datetime(1984, 1, 1, 0, 0), 3, 638.55),
    (datetime.datetime(1984, 1, 1, 0, 0), 4, 638.55),
    (datetime.datetime(1984, 1, 1, 0, 0), 5, 638.55)
]).toDF(["date", "hour", "value"])

elevDF.select(year("date").alias('year'), month("date").alias('month'), dayofmonth("date").alias('day')).show()
# +----+-----+---+
# |year|month|day|
# +----+-----+---+
# |1984|    1|  1|
# |1984|    1|  1|
# |1984|    1|  1|
# |1984|    1|  1|
# |1984|    1|  1|
# +----+-----+---+