As rule of thumb, one would want his RDD to have as many partitions as the product of the number of executors by the number of used cores by 3 (or maybe 4). Of course, that’s a heuristic and it really depends on your application, dataset and cluster configuration.
In : data = sc.textFile(file) In : total_cores = int(sc._conf.get('spark.executor.instances')) * int(sc._conf.get('spark.executor.cores')) In : data = data.coalesce(total_cores * 3)
if you want to reproduce, please indicate the source:
apache-spark – Rule of Thumb about number of partitions - CodeDay