apache-spark – Rule of Thumb about number of partitions

As rule of thumb, one would want his RDD to have as many partitions as the product of the number of executors by the number of used cores by 3 (or maybe 4). Of course, that’s a heuristic and it really depends on your application, dataset and cluster configuration.

Example:

In [1]: data  = sc.textFile(file)

In [2]: total_cores = int(sc._conf.get('spark.executor.instances')) * int(sc._conf.get('spark.executor.cores'))

In [3]: data = data.coalesce(total_cores * 3)      

if you want to reproduce, please indicate the source:
apache-spark – Rule of Thumb about number of partitions - CodeDay