Feynman27 Feynman27 - 1 month ago 14x
Scala Question

Applying a function (mkString) to an entire column in Spark dataframe, error if column name has "."

I'm attempting to apply a function over a column of a Spark dataframe in Scala. The column is a String type, and I'd like to concatenate each token in the string with an "_" delimiter (e.g. "A B" --> "A_B"). I'm doing this with:

val converter: (String => String) = (arg: String) => {arg.split(" ").mkString("_")}
val myUDF = udf(converter)
val newDF = oldDF
.withColumn("TEST", myUDF(oldDF("colA.B")) )

This works for columns in the dataframe with names without a dot ("."). However, the dot in the column name "colA.B" seems to be breaking the code and throws the error:

org.apache.spark.sql.AnalysisException: Cannot resolve column name "colA.B" among (colA.B, col1, col2);

I suppose a work around would be to rename the column (similar to this), but I'd prefer not to do this.


you can try with back quotes like below example (source)

val df = sqlContext.createDataFrame(Seq(
  ("user1", "task1"),
  ("user2", "task2")
)).toDF("user", "user.task")
df.select(df("user"), df("`user.task`")).show()

| user|user.task|
|user1|    task1|
|user2|    task2|

In your case before applying function you need to back quote such column...