数据分布

https://yq.aliyun.com/articles/57822
https://segmentfault.com/a/1190000022005788
http://www.dbdream.com.cn/2016/01/greenplum%E6%95%B0%E6%8D%AE%E5%BA%93%E5%88%9B%E5%BB%BA%E8%A1%A8%E5%8F%8A%E8%A1%A8%E7%9A%84%E6%95%B0%E6%8D%AE%E5%88%86%E5%B8%83/

由于GreenPlum是一个分布式数据库，表的数据当然是分布到所有Segment节点上，那么如何控制表中数据的分布呢？下面就来介绍下GreenPlum的数据分布策略。

	哈希分布	随机分布	复制分布
适用版本	GP5,GP6	GP5,GP6	GP6
语句	DISTRIBUTED BY (column, [ … ])	DISTRIBUTED RANDOMLY	DISTRIBUTED REPLICATED
默认策略	✔	✘	✘
存储	1 segment	1 segment	N segment
均匀分布	取决于分布键	✔	✔
查询性能	✔	✘	-

哈希分布

创建表使用“DISTRIBUTED BY (column, [ … ])”子句，指定某个列或者某几个列的组合作为分布键。

散列算法使分布键将每一行分配给特定的segment。计算分布键的hash值，相同hash值的数据会散列到同一个segment上。选择唯一的分布键（例如Primary Key）将确保较均匀的数据分布。这样，哈希分布的多表关联时，如果关联字段都是分布键，相同hash值的数据肯定会分布到同一个segment，就可以在每个segment关联后，segment把结果发送到master节点，再由master节点汇总，将最终的结果返还客户端。

如果创建表时未提供DISTRIBUTED子句，则将PRIMARY KEY（如果表真的有的话）或表的第一个合格列用作分布键。什么类型的列是合格列呢？几何类型或用户自定义数据类型的列不能用作Greenplum分布键列。如果表中没有合格的列，则退化为随机分布策略。

但是，如果未提供DISTRIBUTED子句，Greenplum最后会选择哪种分布策略还会受其它因素的影响，例如：GUC gp_create_table_random_default_distribution和当时使用的优化器(optimizer)也将影响最终决定表的默认分布策略。因此，请千万不要忘记在CREATE TABLE时添加DISTRIBUTED BY子句。

gp_create_table_random_default_distribution设置为off，表的默认分布策略是哈希分布；设置为on，表的默认分布策略是随机分布。

[gpadmin@server100 ~]$ gpconfig -s gp_create_table_random_default_distribution
Values on all segments are consistent
GUC          : gp_create_table_random_default_distribution
Master  value: off
Segment value: off

默认分布策略为哈希分布，创建表时不指定分布键，默认使用第一个字段作为分布键。

testdb=# show gp_create_table_random_default_distribution;
 gp_create_table_random_default_distribution
---------------------------------------------
 off
(1 row)

testdb=# CREATE TABLE t_hash (name varchar(10), id int);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'name' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE

testdb=# \d+ t_hash
                               Table "public.t_hash"
 Column |         Type          | Modifiers | Storage  | Stats target | Description
--------+-----------------------+-----------+----------+--------------+-------------
 name   | character varying(10) |           | extended |              |
 id     | integer               |           | plain    |              |
Distributed by: (name)

如果表中包含主键，则默认使用主键为分布键。

testdb=# CREATE TABLE t_hash_1 (name varchar(10), id int primary key);
CREATE TABLE

testdb=# \d+ t_hash_1
                              Table "public.t_hash_1"
 Column |         Type          | Modifiers | Storage  | Stats target | Description
--------+-----------------------+-----------+----------+--------------+-------------
 name   | character varying(10) |           | extended |              |
 id     | integer               | not null  | plain    |              |
Indexes:
    "t_hash_1_pkey" PRIMARY KEY, btree (id)
Distributed by: (id)

如果表上有唯一约束但没有主键，那么默认选择唯一字段作为分布键。

testdb=# CREATE TABLE t_hash_2 (name varchar(10), id int unique);
CREATE TABLE

testdb=# \d+ t_hash_2
                              Table "public.t_hash_2"
 Column |         Type          | Modifiers | Storage  | Stats target | Description
--------+-----------------------+-----------+----------+--------------+-------------
 name   | character varying(10) |           | extended |              |
 id     | integer               |           | plain    |              |
Indexes:
    "t_hash_2_id_key" UNIQUE CONSTRAINT, btree (id)
Distributed by: (id)

如果表中同时存在主键和唯一键，那么SQL会报错。Greenplum一个表只能定义一个唯一键。

testdb=# create table t_hash_3 (name varchar(10) unique, id int primary key);
ERROR:  UNIQUE or PRIMARY KEY definitions are incompatible with each other
HINT:  When there are multiple PRIMARY KEY / UNIQUE constraints, they must have at least one column in common.

DISTRIBUTED BY子句指定的分布键必须是主键的子集。如果建表时指定的分布键不是主键，那么SQL会报错。

testdb=# create table t_hash_3 (name varchar(10), id int primary key) distributed by(name);
ERROR:  PRIMARY KEY and DISTRIBUTED BY definitions are incompatible
HINT:  When there is both a PRIMARY KEY and a DISTRIBUTED BY clause, the DISTRIBUTED BY clause must be a subset of the PRIMARY KEY.

随机分布

创建表使用 “DISTRIBUTED RANDOMLY” 子句。

随机分布会将数据行按到来顺序依次循环发送到各个segment上。与哈希分布策略不同，具有相同值的数据行不一定分布于同一个segment上。这样，随机分布的多表关联时，就需要重分布数据，将数据到所有segment去做运算，网络传输和数据运算不仅耗时长，而且性能非常低下。虽然随机分布确保了数据的平均分布，但只要有可能，应该尽量选择哈希分布策略，哈希分布的性能更加优良。GreenPlum不建议需要多表关联的表使用随机分布。

设置gp_create_table_random_default_distribution为on，建表t_random不指定数据的分布策略，表的分布信息就更改为随机分布。

testdb=# set gp_create_table_random_default_distribution=on;
SET
testdb=# show gp_create_table_random_default_distribution;
 gp_create_table_random_default_distribution
---------------------------------------------
 on
(1 row)

testdb=# CREATE TABLE t_random (name varchar(10), id int);
NOTICE:  using default RANDOM distribution since no distribution was specified
HINT:  Consider including the 'DISTRIBUTED BY' clause to determine the distribution of rows.
CREATE TABLE
testdb=# \d t_random
          Table "public.t_random"
 Column |         Type          | Modifiers
--------+-----------------------+-----------
 name   | character varying(10) |
 id     | integer               |
Distributed randomly

复制分布

创建表使用 “DISTRIBUTED REPLICATED” 子句。

Greenplum数据库将每行数据分配到每个segment上。这种分布策略下，表数据将均匀分布，因为每个segment都存储着同样的数据行。当您需要在segment上执行用户自定义的函数且这些函数需要访问表中的所有行时，就需要用到复制分布策略。或者当有大表与小表join，把足够小的表指定为replicated也可能提升性能。

分区策略

原文链接：https://blog.csdn.net/MasterLeon/article/details/105945175