York Huang York Huang - 11 days ago 5
SQL Question

Spark Dataset Error: Both sides of this join are outside the broadcasting threshold and computing it could be prohibitively expensive

I am using Spark2.0.2 in local mode. I have a join which join two datasets.

It is quite fast when using spark sql or dataframe API (untyped Dataset[Row] ).
But when I use typed Dataset API, I get the error below.

Exception in thread "main" org.apache.spark.sql.AnalysisException: Both sides of this join are outside the broadcasting threshold and computing it could be prohibitively expensive. To explicitly enable it, please set spark.sql.crossJoin.enabled = true;

I increase "spark.sql.conf.autoBroadcastJoinThreshold", but it is still the same error. And then I set "spark.sql.crossJoin.enabled" to "true", it works but takes very long time to complete.

I didn't do any repartition. The source are two parquets.

Any idea?

Answer

Autobroadcast threshold is limited to only 2GB (https://issues.apache.org/jira/browse/SPARK-6235) so if the table size is more than this value you will not be able to do so. Workaround could be to provide hint to sparksql using broadcast function as follows:

largeTableDf.join(broadcast(smallTableDf), "key"))