数仓基础和网站流量分析

数仓的设计：（写sql）

维度建模的基本概念：
**维度表：**时间的维度：昨天地点：星巴克金钱的维度：两百块 维度表看到的事情比较狭窄，仅仅从某一个方面来看，只能看得到某一块的东西
事实表：没发生的东西，一定不是事实，事实一定是建立在已经发生过的事情上面

例子：昨天我去星巴克喝了一杯咖啡，花了两百块

维度建模(dimensional modeling)是专门用于分析型数据库、数据仓库、数据集市建模的方法。数据集市可以理解为是一种"小型数据仓库"。
维度表(dimension)
**维度表示你要对数据进行分析时所用的一个量,**比如你要分析产品销售情况, 你可以选择按类别来进行分析,或按区域来分析。这样的按…分析就构成一个维度。再比如"昨天下午我在星巴克花费200元喝了一杯卡布奇诺"。那么以消费为主题进行分析，可从这段信息中提取三个维度：时间维度(昨天下午)，地点维度(星巴克), 商品维度(卡布奇诺)。通常来说维度表信息比较固定，且数据量小。
事实表(fact table)
**表示对分析主题的度量。**事实表包含了与各维度表相关联的外键，并通过JOIN方式与维度表关联。事实表的度量通常是数值类型，且记录数会不断增加，表规模迅速增长。比如上面的消费例子，它的消费事实表结构示例如下：
消费事实表：Prod_id(引用商品维度表), TimeKey(引用时间维度表), Place_id(引用地点维度表), Unit(销售量)。
总的说来，在数据仓库中不需要严格遵守规范化设计原则。因为数据仓库的主导功能就是面向分析，以查询为主，不涉及数据更新操作。事实表的设计是以能够正确记录历史信息为准则，维度表的设计是以能够以合适的角度来聚合主题内容为准则。

维度建模三种模式

星型模式
星形模式(Star Schema)是最常用的维度建模方式。星型模式是以事实表为中心，所有的维度表直接连接在事实表上，像星星一样。
星形模式的维度建模由一个事实表和一组维表成，且具有以下特点：
a. 维表只和事实表关联，维表之间没有关联；
b. 每个维表主键为单列，且该主键放置在事实表中，作为两边连接的外键；
c. 以事实表为核心，维表围绕核心呈星形分布；
在这里插入图片描述
雪花模式
雪花模式(Snowflake Schema)是对星形模式的扩展。雪花模式的维度表可以拥有其他维度表的，虽然这种模型相比星型更规范一些，但是由于这种模型不太容易理解，维护成本比较高，而且性能方面需要关联多层维表，性能也比星型模型要低。所以一般不是很常用。
在这里插入图片描述
星座模式
星座模式是星型模式延伸而来，星型模式是基于一张事实表的，而星座模式是基于多张事实表的，而且共享维度信息。
前面介绍的两种维度建模方法都是多维表对应单事实表，但在很多时候维度空间内的事实表不止一个，而一个维表也可能被多个事实表用到。在业务发展后期，绝大部分维度建模都采用的是星座模式。
在这里插入图片描述

本项目中数据仓库的设计（采用星型模型）

事实表设计
原始数据表: ods_weblog_origin =>对应mr清洗完之后的数据
valid string 是否有效
remote_addr string 访客ip
remote_user string 访客用户信息
time_local string 请求时间
request string 请求url
status string 响应码
body_bytes_sent string 响应字节数
http_referer string 来源url
http_user_agent string 访客终端信息

求统计 15:00:00 16:00:00访问了多少个页面
select count(1) from ods_weblog_origin where time_local >= 15:00:00 and time_local <= 16:00:00
union all
select count(1) from ods_weblog_origin where time_local >= 16:00:00 and time_local <= 17:00:00

第一步：按照小时进行分组 15 16 17
第二步：分组之后，统计每组里面有多少天记录
select count(1) from ods_weblog_origin group by hour

为了方便我们的统计，将我们的日期字段给拆成这样的几个字段，修改之后的表：
访问日志明细宽表：dw_weblog_detail
valid string 是否有效
remote_addr string 访客ip
remote_user string 访客用户信息
time_local string 请求完整时间
daystr string 访问日期
timestr string 访问时间
month string 访问月
day string 访问日
hour string 访问时
request string 请求url整串
status string 响应码
body_bytes_sent string 响应字节数
http_referer string 来源url
ref_host string 来源的host
ref_path string 来源的路径
ref_query string 来源参数query
ref_query_id string 来源参数query值
http_user_agent string 客户终端标识

http_referer http://www.baidu.com/hello.action?username=zhangsan http://www.google.com?address=北京 http://www.sougou.com?money=50
ref_host www.baidu.com
ref_path /hello.action
ref_query username
ref_query_id zhangsan

数据仓库允许冗余

维度表设计（可建可不建）
在这里插入图片描述

模块开发----ETL

**ETL工作的实质就是从各个数据源提取数据，对数据进行转换，并最终加载填充数据到数据仓库维度建模后的表中。**只有当这些维度/事实表被填充好，ETL工作才算完成。
本项目的数据分析过程在hadoop集群上实现，主要应用hive数据仓库工具，因此，采集并经过预处理后的数据，需要加载到hive数据仓库中，以进行后续的分析过程。

原始数据表：对应mr清洗完之后的数据，而不是原始日志数据
drop table if exists ods_weblog_origin;
create table ods_weblog_origin(
valid string,
remote_addr string,
remote_user string,
time_local string,
request string,
status string,
body_bytes_sent string,
http_referer string,
http_user_agent string)
partitioned by (datestr string)
row format delimited
fields terminated by ‘\001’;

拆开我们的httprefer
later view的使用
create table t_ods_tmp_referurl as
SELECT a.,b.
FROM ods_weblog_origin a
LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, “”", “”), ‘HOST’, ‘PATH’,‘QUERY’, ‘QUERY:id’)
b as host, path, query, query_id;

拆开我们的时间字符串
2013-09-18 06:49:18
create table t_ods_tmp_detail as
select b.*,substring(time_local,0,10) as daystr, 2013-09-18
substring(time_local,12) as tmstr, 06:49:18
substring(time_local,6,2) as month, 09
substring(time_local,9,2) as day, 18
substring(time_local,11,3) as hour 06
From t_ods_tmp_referurl b;

统计分析指标

流量分析

1、统计每小时的pvs pageView的访问量
select month,day,hour,count(1) from t_ods_weblog_detail group by month,day,hour;

插入数据
create table if not exists dw_pvs_everyhour_oneday(month string,day string,hour string,pvs bigint) partitioned by(datestr string);

insert into table dw_pvs_everyhour_oneday partition(datestr=‘20130918’)
select a.month as month,a.day as day,a.hour as hour,count(*) as pvs from ods_weblog_detail a where a.datestr=‘20130918’ group by a.month,a.day,a.hour;

2.统计每小时各来访url产生的pv量，查询结果存入：( “dw_pvs_referer_everyhour” )
drop table if exists dw_pvs_referer_everyhour;
create table if not exists dw_pvs_referer_everyhour
(referer_url string,referer_host string,month string,day string,
hour string,pv_referer_cnt bigint) partitioned by(datestr string);

insert into table dw_pvs_referer_everyhour partition(datestr=‘20130918’)
select http_referer,ref_host,month,day,hour,count(1) as pv_referer_cnt
from ods_weblog_detail
group by http_referer,ref_host,month,day,hour
having ref_host is not null
order by hour asc,day asc,month asc,pv_referer_cnt desc;

3.统计每小时各来访host的产生的pv数并排序
create table dw_pvs_refererhost_everyhour(ref_host string,month string,day string,hour string,ref_host_cnts bigint) partitioned by(datestr string);

insert into table dw_pvs_refererhost_everyhour partition(datestr=‘20130918’)
select ref_host,month,day,hour,count(1) as ref_host_cnts
from ods_weblog_detail
group by ref_host,month,day,hour
having ref_host is not null
order by hour asc,day asc,month asc,ref_host_cnts desc;

4.统计pv总量最大的来源TOPN
https://www.cnblogs.com/wulin/p/6051768.html
hive当中的窗口函数，适用于解决分组求topN的情况
（使用hive的开窗函数可以解决标号问题：rank over dese_rank over row_number over）
在这里插入图片描述
partition by后面只能跟一个字段

在这里插入图片描述
rank over 如果有相同的数据，标号一样标号根据数据的条数，往上涨
dense_rank over 如果有相同的数，标号一样，标号顺序网上涨
row_number over 如果有相同的数据，标号顺序往上涨

–需求：按照时间维度，统计一天内各小时产生最多pvs的来源topN
分组求topN，先分组，再求每组内的topN
–row_number函数
select ref_host,ref_host_cnts,concat(month,day,hour),
row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od
from dw_pvs_refererhost_everyhour;

（concat是字符串拼接函数，substring（9,0）是字符串截取函数）

create table dw_pvs_refhost_topn_everyhour(
hour string,
toporder string,
ref_host string,
ref_host_cnts string
)partitioned by(datestr string);

– 每一个子查询语句，都是可以独立运行的
insert into table dw_pvs_refhost_topn_everyhour partition(datestr=‘20130918’)
select t.hour,t.od,t.ref_host,t.ref_host_cnts from
(select ref_host,ref_host_cnts,concat(month,day,hour) as hour,
row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od
from dw_pvs_refererhost_everyhour) t where od<=3;

5.人均浏览页数
计算平均一个人看了多少个页面
第一步：求多少个人用ip来代表我们每一个人人数需要去重 1027
select count(distinct(remote_addr)) from ods_weblog_detail

优化：
select count(1) from (
select count(1) from ods_weblog_detail group by remote_addr
) temp_table

浏览的页数：ods_weblog_detail 这个表有一条算一条 13770

select ‘20130918’,sum(b.pvs)/count(b.remote_addr)
from
(
select remote_addr,count(1) as pvs from ods_weblog_detail where datestr=‘20130918’
group by remote_addr
) b;

受访分析

网站受到了访问，侧重研究是我们网站的访问的情况
1、各页面PV
每个页面，受到了多少次访问
使用request字段来代表我们访问的页面

select
request,count(1)
from ods_weblog_detail
group by request having request is not null ;

select request as request,count(1) as request_counts from
ods_weblog_detail group by request having request is not null order by request_counts desc limit 20;

2.热门页面统计
统计20130918这个分区里面的受访页面的top10

select request, count(1) as total_result
from ods_weblog_detail where datestr=‘20130918’ group by request order by total_result desc limit 10;

insert into table dw_hotpages_everyday
select ‘20130918’,a.request,a.request_counts from
(
select request as request,count(request) as request_counts
from ods_weblog_detail
where datestr=‘20130918’
group by request
having request is not null
) a
order by a.request_counts desc limit 10;

3.统计每日最热门页面的top10
注意如果有很多天的数据，那么就涉及到分组求topN
20130918

select month,day,request,count(1) as total_result
from ods_weblog_detail where datestr = ‘20130918’ group by month,day,request
order by total_result desc limit 10;

select a.month,a.day,a.request ,concat(a.month,a.day),a.total_request
from (
select month,day, request,count(1) as total_request
from ods_weblog_detail
where datestr = ‘20130918’
group by request ,month ,day
having request is not null
order by total_request desc limit 10
) a

4.访客分析：对我们的用户进行分析

（1）
– 独立访客
–需求：按照时间维度来统计独立访客及其产生的pv量
select month,day,hour,remote_addr,
count(1)
from ods_weblog_detail
group by month,day,hour ,remote_addr;

按照天的维度，统计独立访客及其产生的PV量
select month,day,hour,remote_addr,
count(1)
from ods_weblog_detail
group by month,day ,remote_addr;

insert into table dw_user_dstc_ip_h
select remote_addr,count(1) as pvs,concat(month,day,hour) as hour
from ods_weblog_detail
Where datestr=‘20130918’
group by concat(month,day,hour),remote_addr;

（2）新老访客问题
select a.remote_addr ,a.day
from (
select remote_addr,‘20130918’ as day
from ods_weblog_detail newIp
where datestr =‘20130918’
group by remote_addr
) a
left join dw_user_dsct_history hist
on a.remote_addr = hist.ip
where hist.ip is null;

每日新用户插入新访客表
insert into table dw_user_new_d partition(datestr=‘20130918’)
select tmp.day as day,tmp.today_addr as new_ip
from
(
select today.day as day,today.remote_addr as today_addr,old.ip as old_addr
from
(
select distinct remote_addr as remote_addr,“20130918” as day
from ods_weblog_detail where datestr=“20130918”
) today
left outer join
dw_user_dsct_history old
on today.remote_addr=old.ip
) tmp
where tmp.old_addr is null;

–每日新用户追加到历史累计表
insert into table dw_user_dsct_history partition(datestr=‘20130918’)
select day,ip from dw_user_new_d where datestr=‘20130918’;

在这里插入图片描述

使用not in很容易产生笛卡尔积

访客visit分析

分析的是我们一天之内访问了多少次session
1、-- 回头/单次访客统计
单次访客的统计
如果ip访问的次数，大于1 说明在多个session里面都出现了这个ip，说明是回头访客

select remote_addr, count(1) as total_request
from ods_click_stream_visit
group by remote_addr having total_request = 1

查询今日所有回头访客及其访问次数。
insert overwrite table dw_user_returning partition(datestr=‘20130918’)
select tmp.day,tmp.remote_addr,tmp.acc_cnt
from
(select ‘20130918’ as day,remote_addr,count(session) as acc_cnt from ods_click_stream_visit
group by remote_addr) tmp
where tmp.acc_cnt>1;

– 人均访问频次,

平均一个人，访问了多少次
求人数
select count(distinct(remote_addr)) from ods_click_steam_visit
求访问的次数
select count(1) from ods_click_stream_visit/ select count(distinct(remote_addr)) from ods_click_stream_visit
select count(session)/count(distinct remote_addr) from ods_click_stream_visit where datestr=‘20130918’;

– 人均页面浏览量，所有的页面点击次数累加除以所有的独立去重IP总和即可
select sum(pagevisits)/count(distinct (remote_addr))
from ods_click_stream_visit;

原文链接：https://blog.csdn.net/Carina_____/article/details/111600501