Spark DataFrames when udf functions do not accept large enough input variables

User defined functions are defined for up to 22 parameters. Only udf helper is define for at most 10 arguments. To handle functions with larger number of parameters you can use org.apache.spark.sql.UDFRegistration.

For example

val dummy = ((
  x0: Int, x1: Int, x2: Int, x3: Int, x4: Int, x5: Int, x6: Int, x7: Int, 
  x8: Int, x9: Int, x10: Int, x11: Int, x12: Int, x13: Int, x14: Int, 
  x15: Int, x16: Int, x17: Int, x18: Int, x19: Int, x20: Int, x21: Int) => 1)

van be registered:

import org.apache.spark.sql.expressions.UserDefinedFunction

val dummyUdf: UserDefinedFunction = spark.udf.register("dummy", dummy)

and use directly

val df = spark.range(1)
val exprs =  (0 to 21).map(_ => lit(1))

df.select(dummyUdf(exprs: _*))

or by name via callUdf

import org.apache.spark.sql.functions.callUDF

df.select(
  callUDF("dummy", exprs:  _*).alias("dummy")
)

or SQL expression:

df.selectExpr(s"""dummy(${Seq.fill(22)(1).mkString(",")})""")

You can also create an UserDefinedFunction object:

import org.apache.spark.sql.expressions.UserDefinedFunction

Seq(1).toDF.select(UserDefinedFunction(dummy, IntegerType, None)(exprs: _*))

In practice having a function with 22 arguments is not very useful and unless you want to use Scala reflection to generate these there are maintenance nightmare.

I would either consider using collections (array, map) or struct as an input or divide this into multiple modules. For example:

val aLongArray = array((0 to 256).map(_ => lit(1)): _*)

val udfWitharray = udf((xs: Seq[Int]) => 1)

Seq(1).toDF.select(udfWitharray(aLongArray).alias("dummy"))

Leave a Comment

tech