大数据之Hive之扩展项目Youtube案例<三>

3.5、准备工作

3.5.1、创建表

创建表:youtube_oriyoutube_user_ori

创建表:youtube_orcyoutube_user_orc

youtube_ori

create table youtube_ori(

    videoId string,

    uploader string,

    age int,

    category array<string>,

    length int,

    views int,

    rate float,

    ratings int,

    comments int,

    relatedId array<string>)

row format delimited

fields terminated by "\t"

collection items terminated by "&"

stored as textfile;

youtube_user_ori

create table youtube_user_ori(

    uploader string,

    videos int,

    friends int)

clustered by (uploader) into 24 buckets

row format delimited

fields terminated by "\t"

stored as textfile;

 

然后把原始数据插入到orc表中

youtube_orc

create table youtube_orc(

    videoId string,

    uploader string,

    age int,

    category array<string>,

    length int,

    views int,

    rate float,

    ratings int,

    comments int,

    relatedId array<string>)

clustered by (uploader) into 8 buckets

row format delimited fields terminated by "\t"

collection items terminated by "&"

stored as orc;

youtube_user_orc

create table youtube_user_orc(

    uploader string,

    videos int,

    friends int)

clustered by (uploader) into 24 buckets

row format delimited

fields terminated by "\t"

stored as orc;

3.5.2、导入ETL后的数据

youtube_ori

load data inpath "/youtube/output/video/2008/0222" into table youtube_ori;

youtube_user_ori

load data inpath "/youtube/user/2008/0903" into table youtube_user_ori;

3.5.3、向ORC表插入数据

youtube_orc

insert into table youtube_orc select * from youtube_ori;

youtube_user_orc

insert into table youtube_user_orc select * from youtube_user_ori;

3.6、业务分析

3.6.1、统计视频观看数Top10

思路:

1) 使用order by按照views字段做一个全局排序即可,同时我们设置只显示前10条。

最终代码:

select

    videoId,

    uploader,

    age,

    category,

    length,

    views,

    rate,

    ratings,

    comments

from

    youtube_orc

order by

    views

desc limit

    10;

3.6.2、统计视频类别热度Top10

思路:

1) 即统计每个类别有多少个视频,显示出包含视频最多的前10个类别。

2) 我们需要按照类别group by聚合,然后count组内的videoId个数即可。

3) 因为当前表结构为:一个视频对应一个或多个类别。所以如果要group by类别,需要先将类别进行列转行(展开),然后再进行count即可。

4) 最后按照热度排序,显示前10条。

最终代码:

select

    category_name as category,

    count(t1.videoId) as hot

from (

    select

        videoId,

        category_name

    from

        youtube_orc lateral view explode(category) t_catetory as category_name) t1

group by

    t1.category_name

order by

    hot

desc limit

    10;

3.6.3、统计出视频观看数最高的20个视频的所属类别以及类别包含这Top20视频的个数

思路:

1) 先找到观看数最高的20个视频所属条目的所有信息,降序排列

2) 把这20条信息中的category分裂出来(列转行)

3) 最后查询视频分类名称和该分类下有多少个Top20的视频

最终代码:

select

    category_name as category,

    count(t2.videoId) as hot_with_views

from (

    select

        videoId,

        category_name

    from (

        select

            *

        from

            youtube_orc

        order by

            views

        desc limit

            20) t1 lateral view explode(category) t_catetory as category_name) t2

group by

    category_name

order by

    hot_with_views

desc;

3.6.4、统计视频观看数Top50所关联视频的所属类别的热度排名

思路:

1) 查询出观看数最多的前50个视频的所有信息(当然包含了每个视频对应的关联视频),记为临时表t1

t1:观看数前50的视频

select

    *

from

    youtube_orc

order by

    views

desc limit

    50;

2) 将找到的50条视频信息的相关视频relatedId列转行,记为临时表t2

t2:将相关视频的id进行列转行操作

select

    explode(relatedId) as videoId

from

       t1;

3) 将相关视频的idyoutube_orc表进行inner join操作

t5:得到两列数据,一列是category,一列是之前查询出来的相关视频id

(select

    distinct(t2.videoId),

    t3.category

from

    t2

inner join

    youtube_orc t3 on t2.videoId = t3.videoId) t4 lateral view explode(category) t_catetory as category_name;

4) 按照视频类别进行分组,统计每组视频个数,然后排行

最终代码:

select

    category_name as category,

    count(t5.videoId) as hot

from (

    select

        videoId,

        category_name

    from (

        select

            distinct(t2.videoId),

            t3.category

        from (

            select

                explode(relatedId) as videoId

            from (

                select

                    *

                from

                    youtube_orc

                order by

                    views

                desc limit

                    50) t1) t2

        inner join

            youtube_orc t3 on t2.videoId = t3.videoId) t4 lateral view explode(category) t_catetory as category_name) t5

group by

    category_name

order by

    hot

desc;

3.6.5、统计每个类别中的视频热度Top10,以Music为例

思路:

1) 要想统计Music类别中的视频热度Top10,需要先找到Music类别,那么就需要将category展开,所以可以创建一张表用于存放categoryId展开的数据。

2) category展开的表中插入数据。

3) 统计对应类别(Music)中的视频热度。

最终代码:

创建表类别表:

create table youtube_category(

    videoId string,

    uploader string,

    age int,

    categoryId string,

    length int,

    views int,

    rate float,

    ratings int,

    comments int,

    relatedId array<string>)

row format delimited

fields terminated by "\t"

collection items terminated by "&"

stored as orc;

向类别表中插入数据:

insert into table youtube_category 

    select

        videoId,

        uploader,

        age,

        categoryId,

        length,

        views,

        rate,

        ratings,

        comments,

        relatedId

    from

        youtube_orc lateral view explode(category) catetory as categoryId;

统计Music类别的Top10(也可以统计其他)

select

    videoId,

    views

from

    youtube_category

where

    categoryId = "Music"

order by

    views

desc limit

    10;

3.6.6、统计每个类别中视频流量Top10,以Music为例

思路:

1) 创建视频类别展开表(categoryId列转行后的表)

2) 按照ratings排序即可

最终代码:

select

    videoId,

    views,

    ratings

from

    youtube_category

where

    categoryId = "Music"

order by

    ratings

desc limit

    10;

3.6.7 、统计上传视频最多的用户 Top10 以及他们上传的观看次数在 20 的视频

思路:

1) 先找到上传视频最多的10个用户的用户信息

select

    *

from

    youtube_user_orc

order by

    videos

desc limit

    10;

2) 通过uploader字段与youtube_orc表进行join,得到的信息按照views观看次数进行排序即可。

最终代码:

select

    t2.videoId,

    t2.views,

    t2.ratings,

    t1.videos,

    t1.friends

from (

    select

        *

    from

        youtube_user_orc

    order by

        videos desc

    limit

        10) t1

join

    youtube_orc t2

on

    t1.uploader = t2.uploader

order by

    views desc

limit

    20;

3.6.8、统计每个类别视频观看数Top10

思路:

1) 先得到categoryId展开的表数据

2) 子查询按照categoryId进行分区,然后分区内排序,并生成递增数字,该递增数字这一列起名为rank

3) 通过子查询产生的临时表,查询rank值小于等于10的数据行即可。

最终代码

select

    t1.*

from (

    select

        videoId,

        categoryId,

        views,

        row_number() over(partition by categoryId order by views desc) rank from youtube_category) t1

where

    rank <= 10;

四、可能出现的问题

4.1JVM堆内存溢出

描述:java.lang.OutOfMemoryError: Java heap space

解决:在yarn-site.xml中加入如下代码

<property>

       <name>yarn.scheduler.maximum-allocation-mb</name>

       <value>2048</value>

</property>

<property>

      <name>yarn.scheduler.minimum-allocation-mb</name>

      <value>2048</value>

</property>

<property>

       <name>yarn.nodemanager.vmem-pmem-ratio</name>

       <value>2.1</value>

</property>

<property>

       <name>mapred.child.java.opts</name>

       <value>-Xmx1024m</value>

</property>


版权声明:本文为morexyoung原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。