Mark Tozzi Mark Tozzi - 1 year ago 107
Java Question

How to check equality of spark columns after renaming

I am trying to write some tests for a Java Spark-Sql application. One operation I need to test renames a column, and I ran into some difficulty comparing the actual value of the renamed column with my expected value. After some experimentation, I was able to write the following two tests to demonstrate the problem:

First, as a sanity check, I tried this (df is a spark sql

, generated by reading some sample data from a json file I'm testing against):

public void testColumnEquality() throws Exception {
Column val1 = df.col("col2");
Column val2 = df.col("col2");
Assert.assertEquals(val1, val2);

Which passes, as one would expect. Then I tried this:

public void testReanmeColumnEquality() throws Exception {
Column val1 = df.col("col2").as("col2");
Column val2 = df.col("col2").as("col2");
Assert.assertEquals(val1, val2);

which fails with the error
java.lang.AssertionError: expected:<col2 AS col2#4L> but was:<col2 AS col2#5L>

Digging around in the scala code (full disclosure - I know very little scala) it looks like this has to do with the
unique id.

Is there any way to sensibly check that these two columns represent the same operations with the same alias?

(I'm working in spark 1.6, and would ideally like a solution for that version line, but if this is fixed in 2.0 that would also be good information.)

Thanks you.

Answer Source

I wrote a blog post about how to resolve this:

The trick is: check whether the Expression has the Alias trait:

`column.expr() instanceof Alias`

If it does, unpack the child expression and the name using the Extractor pattern:

alias = (Alias) column.expr()
Option<Tuple2<Expression, String>> aliasTuple = Alias$.MODULE$.unapply(alias);
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download