How to control partition size in Spark SQL
Spark < 2.0: You can use Hadoop configuration options: mapred.min.split.size. mapred.max.split.size as well as HDFS block size to control partition size for filesystem based formats*. val minSplit: Int = ??? val maxSplit: Int = ??? sc.hadoopConfiguration.setInt(“mapred.min.split.size”, minSplit) sc.hadoopConfiguration.setInt(“mapred.max.split.size”, maxSplit) Spark 2.0+: You can use spark.sql.files.maxPartitionBytes configuration: spark.conf.set(“spark.sql.files.maxPartitionBytes”, maxSplit) In both cases these values may not … Read more