hive sql系列（七）——查询前20%时间的订单信息

每天分享一个sql，帮助大家找到sql的快乐

需求

查询前20%时间的订单信息

建表语句

create table business(
    name string,
    orderdate string,
    cost int
)
row format delimited fields terminated by '\t'
;

数据

insert overwrite table business values
("jack","2017-01-01",10),
("tony","2017-01-02",15),
("jack","2017-02-03",23),
("tony","2017-01-04",29),
("jack","2017-01-05",46),
("jack","2017-04-06",42),
("tony","2017-01-07",50),
("jack","2017-01-08",55),
("mart","2017-04-08",62),
("mart","2017-04-09",68),
("neil","2017-05-10",12),
("mart","2017-04-11",75),
("neil","2017-06-12",80),
("mart","2017-04-13",94);

实现

select
    name,
    orderdate,
    cost,
    ntile(5) over(order by orderdate) n
from
    business
;

结果

Total MapReduce CPU Time Spent: 3 seconds 190 msec
OK
t1.name	t1.orderdate	t1.cost	t1.n
jack	2017-01-01	10	1
tony	2017-01-02	15	1
tony	2017-01-04	29	1
Time taken: 14.82 seconds, Fetched: 3 row(s)

分析

1、查询前20%时间的订单信息，没有特指某个用户，所以不用指定分区
2、开窗里面对时间排序
3、ntile进行分桶，按照时间顺序分成5份，因为这里说的是20%
4、先用子查询分桶，然后再进行过滤即n=1

扩展

如果不进行排序，则按读取的数据是默认有序
select
    name,
    orderdate,
    cost,
    ntile(5) over() n
from
    business
;

Total MapReduce CPU Time Spent: 2 seconds 260 msec
OK
name	orderdate	cost	n
mart	2017-04-13	94	1
neil	2017-06-12	80	1
mart	2017-04-11	75	1
neil	2017-05-10	12	2
mart	2017-04-09	68	2
mart	2017-04-08	62	2
jack	2017-01-08	55	3
tony	2017-01-07	50	3
jack	2017-04-06	42	3
jack	2017-01-05	46	4
tony	2017-01-04	29	4
jack	2017-02-03	23	4
tony	2017-01-02	15	5
jack	2017-01-01	10	5
Time taken: 14.127 seconds, Fetched: 14 row(s)

知识点

ntile：把有序的数据集合平均分配到指定的数据量个桶中，将桶号分配给每一行。如果不能平均分配，则优先分配较小编号的桶，并且各个桶中能放的行数最多相差1。

原文链接：https://blog.csdn.net/luo981695830/article/details/115384874