How do I get a SQL row_number equivalent for a Spark RDD?

The row_number() over (partition by … order by …) functionality was added to Spark 1.4. This answer uses PySpark/DataFrames. Create a test DataFrame: from pyspark.sql import Row, functions as F testDF = sc.parallelize( (Row(k=”key1″, v=(1,2,3)), Row(k=”key1″, v=(1,4,7)), Row(k=”key1″, v=(2,2,3)), Row(k=”key2″, v=(5,5,5)), Row(k=”key2″, v=(5,5,9)), Row(k=”key2″, v=(7,5,5)) ) ).toDF() Add the partitioned row number: from pyspark.sql.window import … Read more

Add some kind of row number to a mongodb aggregate command / pipeline

Not sure about the performance in big queries, but this is at least an option. You can add your results to an array by grouping/pushing and then unwind with includeArrayIndex like this: [ {$match: {author: {$ne: 1}}}, {$limit: 10000}, {$group: { _id: 1, book: {$push: {title: ‘$title’, author: ‘$author’, copies: ‘$copies’}} }}, {$unwind: {path: ‘$book’, … Read more

How do I use ROW_NUMBER()?

For the first question, why not just use? SELECT COUNT(*) FROM myTable to get the count. And for the second question, the primary key of the row is what should be used to identify a particular row. Don’t try and use the row number for that. If you returned Row_Number() in your main query, SELECT … Read more