I am using Spark2.0.2 in local mode. I have a join which join two datasets.
It is quite fast when using spark sql or dataframe API (untyped Dataset[Row] ).
But when I use typed Dataset API, I get the error below.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Both sides of this join are outside the broadcasting threshold and computing it could be prohibitively expensive. To explicitly enable it, please set spark.sql.crossJoin.enabled = true;
I increase "spark.sql.conf.autoBroadcastJoinThreshold", but it is still the same error. And then I set "spark.sql.crossJoin.enabled" to "true", it works but takes very long time to complete.
I didn't do any repartition. The source are two parquets.
Autobroadcast threshold is limited to only 2GB (https://issues.apache.org/jira/browse/SPARK-6235) so if the table size is more than this value you will not be able to do so. Workaround could be to provide hint to sparksql using broadcast function as follows: