Spark中的spark.sql.shuffle.partitions 和spark.default.parallelism参数设置默认partition数目

当不跟随父对象partition数目的shuffle过程发生后，结果的partition会发生改变，这两个参数就是控制这类shuffle过程后，返回对象的partition的

经过实测，得到结论：

spark.sql.shuffle.partitions 作用于dataframe（val df2=df1.shuffle算子（如df1.orderBy()），的df2的partition就是这个参数的值）

spark.default.parallelism 作用于rdd（val rdd2=rdd1.shuffle算子（如rdd1.reduceByKey()），的rdd2的partition就是这个参数的值）

如何查看操作是否有shuffle？善用rdd的toDebugString函数，详见Spark中的shuffle算子

df也可以先df.rdd.toDebugString查看是否有shuffle发生

修改方法：

代码中设定:

sqlContext.setConf("spark.sql.shuffle.partitions", "500")
sqlContext.setConf("spark.default.parallelism", "500")

提交任务时设定:

./bin/spark-submit --conf spark.sql.shuffle.partitions=500 --conf spark.default.parallelism=500

官方说明和默认值：

spark.default.parallelism

For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:

Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

spark.sql.shuffle.partitions 200（default） Configures the number of partitions to use when shuffling data for joins or aggregations.

跟随父对象partition数目的shuffle？比如df的join，df1.join(df2) 返回partition数目根据df1定

参考资料：

https://spark.apache.org/docs/2.1.0/configuration.html

https://spark.apache.org/docs/latest/sql-performance-tuning.html

https://www.jianshu.com/p/7442deb21ae0

https://stackoverflow.com/questions/45704156/what-is-the-difference-between-spark-sql-shuffle-partitions-and-spark-default-pa

原文链接：https://blog.csdn.net/Code_LT/article/details/102759932