distinct-values – Make Me Engineer

pyspark: count distinct over a window

June 13, 2023 by Tarik

EDIT: as noleto mentions in his answer below, there is now approx_count_distinct available since PySpark 2.1 that works over a window. Original answer – exact distinct count (not an approximation) We can use a combination of size and collect_set to mimic the functionality of countDistinct over a window: from pyspark.sql import functions as F, Window … Read more

Given a number, produce another random number that is the same every time and distinct from all other results

June 4, 2023 by Tarik

This sounds like a non-repeating random number generator. There are several possible approaches to this. As described in this article, we can generate them by selecting a prime number p and satisfies p % 4 = 3 that is large enough (greater than the maximum value in the output range) and generate them this way: … Read more

List distinct values in a vector in R

November 7, 2022 by Tarik

Do you mean unique: R> x = c(1,1,2,3,4,4,4) R> x [1] 1 1 2 3 4 4 4 R> unique(x) [1] 1 2 3 4

How to create a HashSet with distinct elements?

October 5, 2022 by Tarik

Here is a possible comparer that compares an IEnumerable<T> by its elements. You still need to sort manually before adding. One could build the sorting into the comparer, but I don’t think that’s a wise choice. Adding a canonical form of the list seems wiser. This code will only work in .net 4 since it … Read more

Spark DataFrame: count distinct values of every column

July 21, 2022 by Tarik

In pySpark you could do something like this, using countDistinct(): from pyspark.sql.functions import col, countDistinct df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)) Similarly in Scala : import org.apache.spark.sql.functions.countDistinct import org.apache.spark.sql.functions.col df.select(df.columns.map(c => countDistinct(col(c)).alias(c)): _*) If you want to speed things up at the potential loss of accuracy, you could also use approxCountDistinct().

Counting unique / distinct values by group in a data frame

April 28, 2022 by Tarik

In dplyr you may use n_distinct to “count the number of unique values“: library(dplyr) myvec %>% group_by(name) %>% summarise(n_distinct(order_no))

Java 8 Distinct by property

April 27, 2022 by Tarik

Java 8 Distinct by property