Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python

Now a much better way to do this is to use the rdd.aggregateByKey() method. Because this method is so poorly documented in the Apache Spark with Python documentation — and is why I wrote this Q&A — until recently I had been using the above code sequence. But again, it’s less efficient, so avoid doing … Read more

SQL Server : SUM() of multiple rows including where clauses

This will bring back totals per property and type SELECT PropertyID, TYPE, SUM(Amount) FROM yourTable GROUP BY PropertyID, TYPE This will bring back only active values SELECT PropertyID, TYPE, SUM(Amount) FROM yourTable WHERE EndDate IS NULL GROUP BY PropertyID, TYPE and this will bring back totals for properties SELECT PropertyID, SUM(Amount) FROM yourTable WHERE EndDate … Read more

Pass percentiles to pandas agg function

Perhaps not super efficient, but one way would be to create a function yourself: def percentile(n): def percentile_(x): return np.percentile(x, n) percentile_.__name__ = ‘percentile_%s’ % n return percentile_ Then include this in your agg: In [11]: column.agg([np.sum, np.mean, np.std, np.median, np.var, np.min, np.max, percentile(50), percentile(95)]) Out[11]: sum mean std median var amin amax percentile_50 percentile_95 … Read more

aggregate() vs annotate() in Django

I would focus on the example queries rather than your quote from the documentation. Aggregate calculates values for the entire queryset. Annotate calculates summary values for each item in the queryset. Aggregation >>> Book.objects.aggregate(average_price=Avg(‘price’)) {‘average_price’: 34.35} Returns a dictionary containing the average price of all books in the queryset. Annotation >>> q = Book.objects.annotate(num_authors=Count(‘authors’)) >>> … Read more

Linq to Objects – return pairs of numbers from list of numbers

None of the default linq methods can do this lazily and with a single scan. Zipping the sequence with itself does 2 scans and grouping is not entirely lazy. Your best bet is to implement it directly: public static IEnumerable<T[]> Partition<T>(this IEnumerable<T> sequence, int partitionSize) { Contract.Requires(sequence != null) Contract.Requires(partitionSize > 0) var buffer = … Read more

tech