How do I get a SQL row_number equivalent for a Spark RDD?
The row_number() over (partition by … order by …) functionality was added to Spark 1.4. This answer uses PySpark/DataFrames. Create a test DataFrame: from pyspark.sql import Row, functions as F testDF = sc.parallelize( (Row(k=”key1″, v=(1,2,3)), Row(k=”key1″, v=(1,4,7)), Row(k=”key1″, v=(2,2,3)), Row(k=”key2″, v=(5,5,5)), Row(k=”key2″, v=(5,5,9)), Row(k=”key2″, v=(7,5,5)) ) ).toDF() Add the partitioned row number: from pyspark.sql.window import … Read more