Using jq how can I split a very large JSON file into multiple files, each a specific quantity of objects?

[EDIT: This answer has been revised in accordance with the revision to the question.] The key to using jq to solve the problem is the -c command-line option, which produces output in JSON-Lines format (i.e., in the present case, one object per line). You can then use a tool such as awk or split to … Read more

python equivalent of filter() getting two output lists (i.e. partition of a list)

Try this: def partition(pred, iterable): trues = [] falses = [] for item in iterable: if pred(item): trues.append(item) else: falses.append(item) return trues, falses Usage: >>> trues, falses = partition(lambda x: x > 10, [1,4,12,7,42]) >>> trues [12, 42] >>> falses [1, 4, 7] There is also an implementation suggestion in itertools recipes: from itertools import … Read more

Difference between df.repartition and DataFrameWriter partitionBy?

Watch out: I believe the accepted answer is not quite right! I’m glad you ask this question, because the behavior of these similarly-named functions differs in important and unexpected ways that are not well documented in the official spark documentation. The first part of the accepted answer is correct: calling df.repartition(COL, numPartitions=k) will create a … Read more