MrE MrE - 1 year ago 77
Scala Question

Spark 1.6: drop column in DataFrame with escaped column names

Trying to drop a column in a DataFrame, but i have column names with dots in them, which I escaped.

Before I escape, my schema looks like this:

root
|-- user_id: long (nullable = true)
|-- hourOfWeek: string (nullable = true)
|-- observed: string (nullable = true)
|-- raw.hourOfDay: long (nullable = true)
|-- raw.minOfDay: long (nullable = true)
|-- raw.dayOfWeek: long (nullable = true)
|-- raw.sensor2: long (nullable = true)


If I try to drop a column, I get:

df = df.drop("hourOfWeek")
org.apache.spark.sql.AnalysisException: cannot resolve 'raw.hourOfDay' given input columns raw.dayOfWeek, raw.sensor2, observed, raw.hourOfDay, hourOfWeek, raw.minOfDay, user_id;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)


Note that I'm not even trying to drop on the columns with dots in name.
Since I couldn't seem to do much without escaping the column names, I converted the schema to:

root
|-- user_id: long (nullable = true)
|-- hourOfWeek: string (nullable = true)
|-- observed: string (nullable = true)
|-- `raw.hourOfDay`: long (nullable = true)
|-- `raw.minOfDay`: long (nullable = true)
|-- `raw.dayOfWeek`: long (nullable = true)
|-- `raw.sensor2`: long (nullable = true)


but that doesn't seem to help. I still get the same error.

I tried escaping all column names, and drop using the escaped name, but that doesn't work either.

root
|-- `user_id`: long (nullable = true)
|-- `hourOfWeek`: string (nullable = true)
|-- `observed`: string (nullable = true)
|-- `raw.hourOfDay`: long (nullable = true)
|-- `raw.minOfDay`: long (nullable = true)
|-- `raw.dayOfWeek`: long (nullable = true)
|-- `raw.sensor2`: long (nullable = true)

df.drop("`hourOfWeek`")
org.apache.spark.sql.AnalysisException: cannot resolve 'user_id' given input columns `user_id`, `raw.dayOfWeek`, `observed`, `raw.minOfDay`, `raw.hourOfDay`, `raw.sensor2`, `hourOfWeek`;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)


Is there another way to drop a column that would not fail on this type of data?

MrE MrE
Answer Source

Alright, I seem to have found the solution after all:

df.drop(df.col("hourOfWeek")) seems to work

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download