Mark Tozzi Mark Tozzi - 2 months ago 11
Java Question

How to check equality of spark columns after renaming

I am trying to write some tests for a Java Spark-Sql application. One operation I need to test renames a column, and I ran into some difficulty comparing the actual value of the renamed column with my expected value. After some experimentation, I was able to write the following two tests to demonstrate the problem:

First, as a sanity check, I tried this (df is a spark sql

DataFrame
, generated by reading some sample data from a json file I'm testing against):

@Test
public void testColumnEquality() throws Exception {
Column val1 = df.col("col2");
Column val2 = df.col("col2");
Assert.assertEquals(val1, val2);
}


Which passes, as one would expect. Then I tried this:

@Test
public void testReanmeColumnEquality() throws Exception {
Column val1 = df.col("col2").as("col2");
Column val2 = df.col("col2").as("col2");
Assert.assertEquals(val1, val2);
}


which fails with the error
java.lang.AssertionError: expected:<col2 AS col2#4L> but was:<col2 AS col2#5L>


Digging around in the scala code (full disclosure - I know very little scala) it looks like this has to do with the
NamedExpression
unique id.

Is there any way to sensibly check that these two columns represent the same operations with the same alias?

(I'm working in spark 1.6, and would ideally like a solution for that version line, but if this is fixed in 2.0 that would also be good information.)

Thanks you.

Answer

I wrote a blog post about how to resolve this:

The trick is: check whether the Expression has the Alias trait:

`column.expr() instanceof Alias`

If it does, unpack the child expression and the name using the Extractor pattern:

alias = (Alias) column.expr()
Option<Tuple2<Expression, String>> aliasTuple = Alias$.MODULE$.unapply(alias);
Comments