pyspark: count distinct over a window

EDIT: as noleto mentions in his answer below, there is now approx_count_distinct available since PySpark 2.1 that works over a window. Original answer – exact distinct count (not an approximation) We can use a combination of size and collect_set to mimic the functionality of countDistinct over a window: from pyspark.sql import functions as F, Window … Read more

Given a number, produce another random number that is the same every time and distinct from all other results

This sounds like a non-repeating random number generator. There are several possible approaches to this. As described in this article, we can generate them by selecting a prime number p and satisfies p % 4 = 3 that is large enough (greater than the maximum value in the output range) and generate them this way: … Read more

Spark DataFrame: count distinct values of every column

In pySpark you could do something like this, using countDistinct(): from pyspark.sql.functions import col, countDistinct df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)) Similarly in Scala : import org.apache.spark.sql.functions.countDistinct import org.apache.spark.sql.functions.col df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*) If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().

tech