Spark DataFrames when udf functions do not accept large enough input variables

User defined functions are defined for up to 22 parameters. Only udf helper is define for at most 10 arguments. To handle functions with larger number of parameters you can use org.apache.spark.sql.UDFRegistration. For example val dummy = (( x0: Int, x1: Int, x2: Int, x3: Int, x4: Int, x5: Int, x6: Int, x7: Int, x8: … Read more

Cake pattern with overriding abstract type don’t work with Upper Type Bounds

It’s a shortcoming of Scala’s type system. When determining the members in a mixin, Scala uses two rules: First, concrete always overrides abstract. Second, If two members are both concrete, or both abstract, then the one that comes later in linearization order wins. Furthermore, the self type of a trait trait S { this: C … Read more

How to pattern match into an uppercase variable?

The short answer is don’t. Syntax conventions make your code readable and understandable for others. Scala’s convention is that variables start with lower-case and constants and classes start with upper-case. By violating this, not only you get problems like pattern-matching issues, your code becomes less readable. (Believe me, if you ever have to read code … Read more

How to calculate the size of dataframe in bytes in Spark?

Usingspark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes we can get the size of actual Dataframe once its loaded into memory. Check the below code. scala> val df = spark.read.format(“orc”).load(“/tmp/srinivas/”) df: org.apache.spark.sql.DataFrame = [channelGrouping: string, clientId: string … 75 more fields] scala> import org.apache.commons.io.FileUtils import org.apache.commons.io.FileUtils scala> val bytes = spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes bytes: BigInt = 763275709 scala> FileUtils.byteCountToDisplaySize(bytes.toLong) res5: String = 727 MB … Read more

Get a TypeTag from a Type?

It is possible: import scala.reflect.runtime.universe._ import scala.reflect.api val mirror = runtimeMirror(getClass.getClassLoader) // whatever mirror you use to obtain the `Type` def backward[T](tpe: Type): TypeTag[T] = TypeTag(mirror, new api.TypeCreator { def apply[U <: api.Universe with Singleton](m: api.Mirror[U]) = if (m eq mirror) tpe.asInstanceOf[U # Type] else throw new IllegalArgumentException(s”Type tag defined in $mirror cannot be migrated … Read more

Get TypeTag[A] from Class[A]

It is possible to create a TypeTag from a Class using Scala reflection, though I’m not sure if this implementation of TypeCreator is absolutely correct: import scala.reflect.runtime.universe._ def createOld[A](c: Class[A]): A = createNew { val mirror = runtimeMirror(c.getClassLoader) // obtain runtime mirror val sym = mirror.staticClass(c.getName) // obtain class symbol for `c` val tpe = … Read more

Scala macros: What is the difference between typed (aka typechecked) and untyped Trees

Theoretical part This is an architectural peculiarity of scalac that started leaking into the public API once we exposed internal compiler data structures in compile-time / runtime reflection in 2.10. Very roughly speaking, scalac’s frontend consists of a parser and a typer, both of which work with trees and produce trees as their result. However … Read more

How to transpose an RDD in Spark

Say you have an N×M matrix. If both N and M are so small that you can hold N×M items in memory, it doesn’t make much sense to use an RDD. But transposing it is easy: val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9))) val transposed = sc.parallelize(rdd.collect.toSeq.transpose) If N or … Read more

Scala: Implicit parameter resolution precedence

I wrote my own answer in the form of a blog post revisiting implicits without import tax. Update: Furthermore, the comments from Martin Odersky in the above post revealed that the Scala 2.9.1’s behavior of LocalIntFoo winning over ImportedIntFoo is in fact a bug. See implicit parameter precedence again. 1) implicits visible to current invocation … Read more