How to understand the dynamic programming solution in linear partitioning?

Be aware that there’s a small mistake in the explanation of the algorithm in the book, look in the errata for the text “(*) Page 297”. About your questions: No, the items don’t need to be sorted, only contiguous (that is, you can’t rearrange them) I believe the easiest way to visualize the algorithm is … Read more

Why does sortBy transformation trigger a Spark job?

sortBy is implemented using sortByKey which depends on a RangePartitioner (JVM) or partitioning function (Python). When you call sortBy / sortByKey partitioner (partitioning function) is initialized eagerly and samples input RDD to compute partition boundaries. Job you see corresponds to this process. Actual sorting is performed only if you execute an action on the newly … Read more

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

Yes, a spark application has one and only Driver. What is the relationship between numWorkerNodes and numExecutors? A worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker. So … Read more

Default Partitioning Scheme in Spark

You have to distinguish between two different things: partitioning as distributing data between partitions depending on a value of the key which is limited only to the PairwiseRDDs (RDD[(T, U)]). This creates a relationship between partition and the set of keys which can be found on a given partition. partitioning as splitting input into multiple … Read more

Efficient way to divide a list into lists of n size

You’ll want to do something that makes use of List.subList(int, int) views rather than copying each sublist. To do this really easily, use Guava’s Lists.partition(List, int) method: List<Foo> foos = … for (List<Foo> partition : Lists.partition(foos, n)) { // do something with partition } Note that this, like many things, isn’t very efficient with a … Read more

Avoid performance impact of a single partition mode in Spark window functions

In practice performance impact will be almost the same as if you omitted partitionBy clause at all. All records will be shuffled to a single partition, sorted locally and iterated sequentially one by one. The difference is only in the number of partitions created in total. Let’s illustrate that with an example using simple dataset … Read more

How to optimize partitioning when migrating data from JDBC source?

Determine how many partitions you need given the amount of input data and your cluster resources. As a rule of thumb it is better to keep partition input under 1GB unless strictly necessary. and strictly smaller than the block size limit. You’ve previously stated that you migrate 1TB of data values you use in different … Read more

LINQ Partition List into Lists of 8 members [duplicate]

Use the following extension method to break the input into subsets public static class IEnumerableExtensions { public static IEnumerable<List<T>> InSetsOf<T>(this IEnumerable<T> source, int max) { List<T> toReturn = new List<T>(max); foreach(var item in source) { toReturn.Add(item); if (toReturn.Count == max) { yield return toReturn; toReturn = new List<T>(max); } } if (toReturn.Any()) { yield return … Read more