Sai Wai Maung Sai Wai Maung - 1 month ago 34
Scala Question

Scala: Spark SQL to_date(unix_timestamp) returning NULL

Spark Version: spark-2.0.1-bin-hadoop2.7
Scala: 2.11.8


I am loading a raw csv into a DataFrame. In csv, although the column is support to be in date format, they are written as 20161025 instead of 2016-10-25. The parameter
date_format
includes string of column names that need to be converted to yyyy-mm-dd format.

In the following code, I first loaded the csv of Date column as StringType via the
schema
, and then I check if the
date_format
is not empty, that is there are columns that need to be converted to
Date
from
String
, then cast each column using
unix_timestamp
and
to_date
. However, in the
csv_df.show()
, the returned rows are all
null
.

def read_csv(csv_source:String, delimiter:String, is_first_line_header:Boolean,
schema:StructType, date_format:List[String]): DataFrame = {
println("|||| Reading CSV Input ||||")

var csv_df = sqlContext.read
.format("com.databricks.spark.csv")
.schema(schema)
.option("header", is_first_line_header)
.option("delimiter", delimiter)
.load(csv_source)
println("|||| Successfully read CSV. Number of rows -> " + csv_df.count() + " ||||")
if(date_format.length > 0) {
for (i <- 0 until date_format.length) {
csv_df = csv_df.select(to_date(unix_timestamp(
csv_df(date_format(i)), "yyyy-­MM-­dd").cast("timestamp")))
csv_df.show()
}
}
csv_df
}


Returned Top 20 rows:

+-------------------------------------------------------------------------+


|to_date(CAST(unix_timestamp(prom_price_date, YYYY-­MM-­DD) AS TIMESTAMP))|
+-------------------------------------------------------------------------+
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
+-------------------------------------------------------------------------+

Why am I getting all
null
?

Answer

To convert yyyyMMdd to yyyy-MM-dd you can:

spark.sql("""SELECT DATE_FORMAT(
  CAST(UNIX_TIMESTAMP('20161025', 'yyyyMMdd') AS TIMESTAMP), 'yyyy-MM-dd'
)""")

with functions:

date_format(unix_timestamp(col, "yyyyMMdd").cast("timestamp"), "yyyy-MM-dd")
Comments