Why does format(“kafka”) fail with “Failed to find data source: kafka.” (even with uber-jar)?

kafka data source is an external module and is not available to Spark applications by default. You have to define it as a dependency in your pom.xml (as you have done), but that’s just the very first step to have it in your Spark application. <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql-kafka-0-10_2.11</artifactId> <version>2.2.0</version> </dependency> With that dependency you have … Read more

Spark on YARN + Secured hbase

You are not alone in the quest for Kerberos auth to HBase from Spark, cf. SPARK-12279 A little-known fact is that Spark now generates Hadoop “auth tokens” for Yarn, HDFS, Hive, HBase on startup. These tokens are then broadcasted to the executors, so that they don’t have to mess again with Kerberos auth, keytabs, etc. … Read more

How to read records in JSON format from Kafka using Structured Streaming?

From the Spark perspective value is just a byte sequence. It has no knowledge about the serialization format or content. To be able to extract the filed you have to parse it first. If data is serialized as a JSON string you have two options. You can cast value to StringType and use from_json and … Read more

Spark Strutured Streaming automatically converts timestamp to local time

For me it worked to use: spark.conf.set(“spark.sql.session.timeZone”, “UTC”) It tells the spark SQL to use UTC as a default timezone for timestamps. I used it in spark SQL for example: select *, cast(‘2017-01-01 10:10:10’ as timestamp) from someTable I know it does not work in 2.0.1. but works in Spark 2.2. I used in SQLTransformer … Read more

Integrating Spark Structured Streaming with the Confluent Schema Registry

It took me a couple months of reading source code and testing things out. In a nutshell, Spark can only handle String and Binary serialization. You must manually deserialize the data. In spark, create the confluent rest service object to get the schema. Convert the schema string in the response object into an Avro schema … Read more