smeeb smeeb - 1 month ago 25
Scala Question

Adding StringType column to existing Spark DataFrame and then applying default values

Scala 2.10 here using Spark 1.6.2. I have a similar (but not the same) question as this one, however, the accepted answer is not an SSCCE and assumes a certain amount of "upfront knowledge" about Spark; and therefore I can't reproduce it or make sense of it. More importantly, that question is also just limited to adding a new column to an existing dataframe, whereas I need to add a column as well as a value for all existing rows in the dataframe.




So I want to add a column to an existing Spark DataFrame, and then apply an initial ('default') value for that new column to all rows.

val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)

jsonDF.show()


When I run that I get this following as output (via
.show()
):

+----+--------+
| x| y|
+----+--------+
|true|not true|
+----+--------+


Now I want to add a new field to
jsonDF
, after it's created and without modifying the
json
string, such that the resultant DF would look like this:

+----+--------+----+
| x| y| z|
+----+--------+----+
|true|not true| red|
+----+--------+----+


Meaning, I want to add a new "
z
" column to the DF, of type
StringType
, and then default all rows to contain a
z
-value of
"red"
.

From that other question I have pieced the following pseudo-code together:

val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)

//jsonDF.show()

val newDF = jsonDF.withColumn("z", jsonDF("col") + 1)

newDF.show()


But when I run this, I get a compiler error on that
.withColumn(...)
method:

org.apache.spark.sql.AnalysisException: Cannot resolve column name "col" among (x, y);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:664)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:652)


I also don't see any API methods that would allow me to set
"red"
as the default value. Any ideas as to where I'm going awry?

Answer

You can use lit function. First you have to import it

import org.apache.spark.sql.functions.lit

and use it as shown below

jsonDF.withColumn("z", lit("red"))

Type of the column will be inferred automatically.