3.5、准备工作
3.5.1、创建表
创建表:youtube_ori,youtube_user_ori,
创建表:youtube_orc,youtube_user_orc
youtube_ori:
create table youtube_ori( videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int, relatedId array<string>) row format delimited fields terminated by "\t" collection items terminated by "&" stored as textfile; |
youtube_user_ori:
create table youtube_user_ori( uploader string, videos int, friends int) clustered by (uploader) into 24 buckets row format delimited fields terminated by "\t" stored as textfile; |
然后把原始数据插入到orc表中
youtube_orc:
create table youtube_orc( videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int, relatedId array<string>) clustered by (uploader) into 8 buckets row format delimited fields terminated by "\t" collection items terminated by "&" stored as orc; |
youtube_user_orc:
create table youtube_user_orc( uploader string, videos int, friends int) clustered by (uploader) into 24 buckets row format delimited fields terminated by "\t" stored as orc; |
3.5.2、导入ETL后的数据
youtube_ori:
load da |
youtube_user_ori:
load da |
3.5.3、向ORC表插入数据
youtube_orc:
insert into table youtube_orc select * from youtube_ori; |
youtube_user_orc:
insert into table youtube_user_orc select * from youtube_user_ori; |
3.6、业务分析
3.6.1、统计视频观看数Top10
思路:
1) 使用order by按照views字段做一个全局排序即可,同时我们设置只显示前10条。
最终代码:
select videoId, uploader, age, category, length, views, rate, ratings, comments from youtube_orc order by views desc limit 10; |
3.6.2、统计视频类别热度Top10
思路:
1) 即统计每个类别有多少个视频,显示出包含视频最多的前10个类别。
2) 我们需要按照类别group by聚合,然后count组内的videoId个数即可。
3) 因为当前表结构为:一个视频对应一个或多个类别。所以如果要group by类别,需要先将类别进行列转行(展开),然后再进行count即可。
4) 最后按照热度排序,显示前10条。
最终代码:
select category_name as category, count(t1.videoId) as hot from ( select videoId, category_name from youtube_orc lateral view explode(category) t_catetory as category_name) t1 group by t1.category_name order by hot desc limit 10; |
3.6.3、统计出视频观看数最高的20个视频的所属类别以及类别包含这Top20视频的个数
思路:
1) 先找到观看数最高的20个视频所属条目的所有信息,降序排列
2) 把这20条信息中的category分裂出来(列转行)
3) 最后查询视频分类名称和该分类下有多少个Top20的视频
最终代码:
select category_name as category, count(t2.videoId) as hot_with_views from ( select videoId, category_name from ( select * from youtube_orc order by views desc limit 20) t1 lateral view explode(category) t_catetory as category_name) t2 group by category_name order by hot_with_views desc; |
3.6.4、统计视频观看数Top50所关联视频的所属类别的热度排名
思路:
1) 查询出观看数最多的前50个视频的所有信息(当然包含了每个视频对应的关联视频),记为临时表t1
t1:观看数前50的视频 select * from youtube_orc order by views desc limit 50; |
2) 将找到的50条视频信息的相关视频relatedId列转行,记为临时表t2
t2:将相关视频的id进行列转行操作 select explode(relatedId) as videoId from t1; |
3) 将相关视频的id和youtube_orc表进行inner join操作
t5:得到两列数据,一列是category,一列是之前查询出来的相关视频id (select distinct(t2.videoId), t3.category from t2 inner join youtube_orc t3 on t2.videoId = t3.videoId) t4 lateral view explode(category) t_catetory as category_name; |
4) 按照视频类别进行分组,统计每组视频个数,然后排行
最终代码:
select category_name as category, count(t5.videoId) as hot from ( select videoId, category_name from ( select distinct(t2.videoId), t3.category from ( select explode(relatedId) as videoId from ( select * from youtube_orc order by views desc limit 50) t1) t2 inner join youtube_orc t3 on t2.videoId = t3.videoId) t4 lateral view explode(category) t_catetory as category_name) t5 group by category_name order by hot desc; |
3.6.5、统计每个类别中的视频热度Top10,以Music为例
思路:
1) 要想统计Music类别中的视频热度Top10,需要先找到Music类别,那么就需要将category展开,所以可以创建一张表用于存放categoryId展开的数据。
2) 向category展开的表中插入数据。
3) 统计对应类别(Music)中的视频热度。
最终代码:
创建表类别表:
create table youtube_category( videoId string, uploader string, age int, categoryId string, length int, views int, rate float, ratings int, comments int, relatedId array<string>) row format delimited fields terminated by "\t" collection items terminated by "&" stored as orc; |
向类别表中插入数据:
insert into table youtube_category select videoId, uploader, age, categoryId, length, views, rate, ratings, comments, relatedId from youtube_orc lateral view explode(category) catetory as categoryId; |
统计Music类别的Top10(也可以统计其他)
select videoId, views from youtube_category where categoryId = "Music" order by views desc limit 10; |
3.6.6、统计每个类别中视频流量Top10,以Music为例
思路:
1) 创建视频类别展开表(categoryId列转行后的表)
2) 按照ratings排序即可
最终代码:
select videoId, views, ratings from youtube_category where categoryId = "Music" order by ratings desc limit 10; |
思路:
1) 先找到上传视频最多的10个用户的用户信息
select * from youtube_user_orc order by videos desc limit 10; |
2) 通过uploader字段与youtube_orc表进行join,得到的信息按照views观看次数进行排序即可。
最终代码:
select t2.videoId, t2.views, t2.ratings, t1.videos, t1.friends from ( select * from youtube_user_orc order by videos desc limit 10) t1 join youtube_orc t2 on t1.uploader = t2.uploader order by views desc limit 20; |
3.6.8、统计每个类别视频观看数Top10
思路:
1) 先得到categoryId展开的表数据
2) 子查询按照categoryId进行分区,然后分区内排序,并生成递增数字,该递增数字这一列起名为rank列
3) 通过子查询产生的临时表,查询rank值小于等于10的数据行即可。
最终代码:
select t1.* from ( select videoId, categoryId, views, row_number() over(partition by categoryId order by views desc) rank from youtube_category) t1 where rank <= 10; |
四、可能出现的问题
4.1、JVM堆内存溢出
描述:java.lang.OutOfMemoryError: Java heap space
解决:在yarn-site.xml中加入如下代码
<property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2048</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>2048</value> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>2.1</value> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx1024m</value> </property> |