How to make a dataframe for kafka streaming using PySpark?

As you said,

kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer3", {topic: 1})
lines = kvs.map(lambda x: x[1])

Here, lines is a dStream of rdds and not a single a rdd in itself. Hence, to get a dataframe you have to convert it into a dStream of dataframes.
Something like this,

lines.foreachRDD(lambda rdd: rdd.toDF())

Leave a Comment