smeeb smeeb - 1 year ago 221
Scala Question

Adding StringType column to existing Spark DataFrame and then applying default values

Scala 2.10 here using Spark 1.6.2. I have a similar (but not the same) question as this one, however, the accepted answer is not an SSCCE and assumes a certain amount of "upfront knowledge" about Spark; and therefore I can't reproduce it or make sense of it. More importantly, that question is also just limited to adding a new column to an existing dataframe, whereas I need to add a column as well as a value for all existing rows in the dataframe.

So I want to add a column to an existing Spark DataFrame, and then apply an initial ('default') value for that new column to all rows.

val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF =

When I run that I get this following as output (via

| x| y|
|true|not true|

Now I want to add a new field to
, after it's created and without modifying the
string, such that the resultant DF would look like this:

| x| y| z|
|true|not true| red|

Meaning, I want to add a new "
" column to the DF, of type
, and then default all rows to contain a
-value of

From that other question I have pieced the following pseudo-code together:

val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF =


val newDF = jsonDF.withColumn("z", jsonDF("col") + 1)

But when I run this, I get a compiler error on that

org.apache.spark.sql.AnalysisException: Cannot resolve column name "col" among (x, y);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:664)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:652)

I also don't see any API methods that would allow me to set
as the default value. Any ideas as to where I'm going awry?

Answer Source

You can use lit function. First you have to import it

import org.apache.spark.sql.functions.lit

and use it as shown below

jsonDF.withColumn("z", lit("red"))

Type of the column will be inferred automatically.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download