I am preparing a DataFrame with an id and a vector of my features to be used later for doing predictions. I do a groupBy on my dataframe, and in my groupBy I am merging couple of columns as lists into a new column:
def mergeFunction(...) // with 14 input variables
val myudffunction( mergeFunction ) // Spark doesn't support this
collect_list(df(...)) as ...
... // too many of these (something like 14 of them)
, col(...) )
I am not sure how else I can fix this? Is the size of udf inputs in
Spark going to get bigger, am have I understood them incorrectly, or
there is a better way?
User defined functions are defined for up to 22 parameters. For example
val dummy = (( x0: Int, x1: Int, x2: Int, x3: Int, x4: Int, x5: Int, x6: Int, x7: Int, x8: Int, x9: Int, x10: Int, x11: Int, x12: Int, x13: Int, x14: Int, x15: Int, x16: Int, x17: Int, x18: Int, x19: Int, x20: Int, x21: Int) => 1)
Can be registered:
and used with SQL or called by name:
val exprs = (0 to 21).map(_ => lit(1)) Seq(1).toDF.select( callUDF("dummy", exprs: _*).alias("dummy") )
You can also create an
import org.apache.spark.sql.expressions.UserDefinedFunction Seq(1).toDF.select(UserDefinedFunction(dummy, IntegerType, None)(exprs: _*))
In practice having a function with 22 arguments is not very useful and unless you want to use Scala reflection to generate these there are maintenance nightmare.
I would either consider using collections (
struct as an input or divide this into multiple modules.