partitioning – Make Me Engineer

How to understand the dynamic programming solution in linear partitioning?

June 15, 2023 by Tarik

Be aware that there’s a small mistake in the explanation of the algorithm in the book, look in the errata for the text “(*) Page 297”. About your questions: No, the items don’t need to be sorted, only contiguous (that is, you can’t rearrange them) I believe the easiest way to visualize the algorithm is … Read more

Hive: Add partitions for existing folder structure

June 4, 2023 by Tarik

Use msck repair table command: MSCK [REPAIR] TABLE tablename; or ALTER TABLE tablename RECOVER PARTITIONS; if you are running Hive on EMR. Read more details about both commands here: RECOVER PARTITIONS

Why does sortBy transformation trigger a Spark job?

June 2, 2023 by Tarik

sortBy is implemented using sortByKey which depends on a RangePartitioner (JVM) or partitioning function (Python). When you call sortBy / sortByKey partitioner (partitioning function) is initialized eagerly and samples input RDD to compute partition boundaries. Job you see corresponds to this process. Actual sorting is performed only if you execute an action on the newly … Read more

How to partition an array of integers in a way that minimizes the maximum of the sum of each partition?

May 5, 2023 by Tarik

Use binary search. Let max sum range from 0 to sum(array). So, mid = (range / 2). See if mid can be achieved by partitioning into k sets in O(n) time. If yes, go for lower range and if not, go for a higher range. This will give you the result in O(n log n). … Read more

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

May 2, 2023 by Tarik

Yes, a spark application has one and only Driver. What is the relationship between numWorkerNodes and numExecutors? A worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker. So … Read more

Default Partitioning Scheme in Spark

November 21, 2022 by Tarik

You have to distinguish between two different things: partitioning as distributing data between partitions depending on a value of the key which is limited only to the PairwiseRDDs (RDD[(T, U)]). This creates a relationship between partition and the set of keys which can be found on a given partition. partitioning as splitting input into multiple … Read more

Efficient way to divide a list into lists of n size

October 10, 2022 by Tarik

You’ll want to do something that makes use of List.subList(int, int) views rather than copying each sublist. To do this really easily, use Guava’s Lists.partition(List, int) method: List<Foo> foos = … for (List<Foo> partition : Lists.partition(foos, n)) { // do something with partition } Note that this, like many things, isn’t very efficient with a … Read more

Avoid performance impact of a single partition mode in Spark window functions

August 8, 2022 by Tarik

In practice performance impact will be almost the same as if you omitted partitionBy clause at all. All records will be shuffled to a single partition, sorted locally and iterated sequentially one by one. The difference is only in the number of partitions created in total. Let’s illustrate that with an example using simple dataset … Read more

How to optimize partitioning when migrating data from JDBC source?

July 8, 2022 by Tarik

Determine how many partitions you need given the amount of input data and your cluster resources. As a rule of thumb it is better to keep partition input under 1GB unless strictly necessary. and strictly smaller than the block size limit. You’ve previously stated that you migrate 1TB of data values you use in different … Read more

LINQ Partition List into Lists of 8 members [duplicate]

July 1, 2022 by Tarik

Use the following extension method to break the input into subsets public static class IEnumerableExtensions { public static IEnumerable<List<T>> InSetsOf<T>(this IEnumerable<T> source, int max) { List<T> toReturn = new List<T>(max); foreach(var item in source) { toReturn.Add(item); if (toReturn.Count == max) { yield return toReturn; toReturn = new List<T>(max); } } if (toReturn.Any()) { yield return … Read more