data-partitioning – Make Me Engineer

QuickSort and Hoare Partition

May 30, 2023 by Tarik

To answer the question of “Why does Hoare partitioning work?”: Let’s simplify the values in the array to just three kinds: L values (those less than the pivot value), E values (those equal to the pivot value), and G value (those larger than the pivot value). We’ll also give a special name to one location … Read more

Using jq how can I split a very large JSON file into multiple files, each a specific quantity of objects?

November 21, 2022 by Tarik

[EDIT: This answer has been revised in accordance with the revision to the question.] The key to using jq to solve the problem is the -c command-line option, which produces output in JSON-Lines format (i.e., in the present case, one object per line). You can then use a tool such as awk or split to … Read more

python equivalent of filter() getting two output lists (i.e. partition of a list)

July 11, 2022 by Tarik

Try this: def partition(pred, iterable): trues = [] falses = [] for item in iterable: if pred(item): trues.append(item) else: falses.append(item) return trues, falses Usage: >>> trues, falses = partition(lambda x: x > 10, [1,4,12,7,42]) >>> trues [12, 42] >>> falses [1, 4, 7] There is also an implementation suggestion in itertools recipes: from itertools import … Read more

Difference between df.repartition and DataFrameWriter partitionBy?

July 2, 2022 by Tarik

Watch out: I believe the accepted answer is not quite right! I’m glad you ask this question, because the behavior of these similarly-named functions differs in important and unexpected ways that are not well documented in the official spark documentation. The first part of the accepted answer is correct: calling df.repartition(COL, numPartitions=k) will create a … Read more

Create grouping variable for consecutive sequences and split vector

May 19, 2022 by Tarik

Making heavy use of some R idioms: > split(v, cumsum(c(1, diff(v) != 1))) $`1` [1] 1 $`2` [1] 3 4 5 $`3` [1] 9 10 $`4` [1] 17 $`5` [1] 29 30