Feynman27 Feynman27 - 1 year ago 77
Scala Question

Applying a function (mkString) to an entire column in Spark dataframe, error if column name has "."

I'm attempting to apply a function over a column of a Spark dataframe in Scala. The column is a String type, and I'd like to concatenate each token in the string with an "_" delimiter (e.g. "A B" --> "A_B"). I'm doing this with:

val converter: (String => String) = (arg: String) => {arg.split(" ").mkString("_")}
val myUDF = udf(converter)
val newDF = oldDF
.withColumn("TEST", myUDF(oldDF("colA.B")) )

This works for columns in the dataframe with names without a dot ("."). However, the dot in the column name "colA.B" seems to be breaking the code and throws the error:

org.apache.spark.sql.AnalysisException: Cannot resolve column name "colA.B" among (colA.B, col1, col2);

I suppose a work around would be to rename the column (similar to this), but I'd prefer not to do this.

Answer Source

you can try with back quotes like below example (source)

val df = sqlContext.createDataFrame(Seq(
  ("user1", "task1"),
  ("user2", "task2")
)).toDF("user", "user.task")
df.select(df("user"), df("`user.task`")).show()

| user|user.task|
|user1|    task1|
|user2|    task2|

In your case before applying function you need to back quote such column...