How to query JSON data column using Spark DataFrames?
zero323’s answer is thorough but misses one approach that is available in Spark 2.1+ and is simpler and more robust than using schema_of_json(): import org.apache.spark.sql.functions.from_json val json_schema = spark.read.json(df.select(“jsonData”).as[String]).schema df.withColumn(“jsonData”, from_json($”jsonData”, json_schema)) Here’s the Python equivalent: from pyspark.sql.functions import from_json json_schema = spark.read.json(df.select(“jsonData”).rdd.map(lambda x: x[0])).schema df.withColumn(“jsonData”, from_json(“jsonData”, json_schema)) The problem with schema_of_json(), as zero323 points … Read more