How to query JSON data column using Spark DataFrames?

zero323’s answer is thorough but misses one approach that is available in Spark 2.1+ and is simpler and more robust than using schema_of_json(): import org.apache.spark.sql.functions.from_json val json_schema = spark.read.json(df.select(“jsonData”).as[String]).schema df.withColumn(“jsonData”, from_json($”jsonData”, json_schema)) Here’s the Python equivalent: from pyspark.sql.functions import from_json json_schema = spark.read.json(df.select(“jsonData”).rdd.map(lambda x: x[0])).schema df.withColumn(“jsonData”, from_json(“jsonData”, json_schema)) The problem with schema_of_json(), as zero323 points … Read more

How to store custom objects in Dataset?

Update This answer is still valid and informative, although things are now better since 2.2/2.3, which adds built-in encoder support for Set, Seq, Map, Date, Timestamp, and BigDecimal. If you stick to making types with only case classes and the usual Scala types, you should be fine with just the implicit in SQLImplicits. Unfortunately, virtually … Read more

How do I get around type erasure on Scala? Or, why can’t I get the type parameter of my collections?

You can do this using TypeTags (as Daniel already mentions, but I’ll just spell it out explicitly): import scala.reflect.runtime.universe._ def matchList[A: TypeTag](list: List[A]) = list match { case strlist: List[String @unchecked] if typeOf[A] =:= typeOf[String] => println(“A list of strings!”) case intlist: List[Int @unchecked] if typeOf[A] =:= typeOf[Int] => println(“A list of ints!”) } You … Read more

What are all the uses of an underscore in Scala?

The ones I can think of are Existential types def foo(l: List[Option[_]]) = … Higher kinded type parameters case class A[K[_],T](a: K[T]) Ignored variables val _ = 5 Ignored parameters List(1, 2, 3) foreach { _ => println(“Hi”) } Ignored names of self types trait MySeq { _: Seq[_] => } Wildcard patterns Some(5) match … Read more

tech