hive基础语法（2）

10.1.6、表生成函数
a. 行转列函数：explode()
1）、准备数据：假如有以下数据
1，zhangsan，化学：物理：数学：语文
2，lisi，数学：生物：生理：卫生
3，wangwu，化学：语文：英语：体育：生物
2）、映射成一张表：
create table t_stu_stubject(id int,name string,subjects arrays)
row format delimited fields terminated by “,”
collection items terminated by “:”;
3）、加载本地数据到数据表中
load data local inpath “/root/hivetest/stu_sub.data” into table t_stu_subject;
4）、使用explode()对数组字段“炸裂”
如果不使用explode()函数:
select * from t_stu_subject; hive基础语法（2）
使用了explode()函数:
select explode(subjects) from t_stu_subject;
然后我们利用这个explode()的结果，来求去重的课程；
select distinct tmp.sub
from
(select explode(subjects) as sub from t_stu_subject) as tmp;
b. 表生成函数lateral view
1）、引用上张表t_stu_subject
select id,name,tmp.sub
from
t_stu_subject lateral view explode(subjects) tmp as sub;
hive基础语法（2）
原理：lateral view相当于两个表在join
左表是原表，右表是explode(某个集合字段)之后产生的表
而且:这个join只在同一行的数据间进行
那样，可以方便做更多的查询：
比如，查询选修了生物的同学
select a.id,a.name,a.sub
from
(select id,name,tmp.sub as sub from t_stu_subject lateral view explode(subjects) tmp as sub) as a
where sub=‘生物’； hive基础语法（2）
案例：有以下文本文件：
hello tom hello jim
hello rose hello tom
tom love rose rose love jim
jim love tom love is what
what is love
需要用hive做wordcount
第一步：建表映射
create table t_wc(sentence string);
第二步：加载本地数据
load data local inpath ‘/root/hivetest/hive_wc.dat’ into table t_wc;
hive基础语法（2）
select split(sentence,’ ‘) from t_wc;

select explode(split(sentence,’ ‘)) as word from t_wc;

第三步：做wordcount
hql答案：
select words.word,count(1)
from
(select explode(split(setence,’ ‘)) as word from t_wc) words
group by
words.word;
按照出现的次数进行排序
select words.word,count(1) as cnts
from
(select explode(split(sentence,’ ')) as word from t_wc) as words
group by
words.word order by cnts asc;
hive基础语法（2）
10.1.7、集合函数
a. array_contains(Array,value) 返回boolean值 --> 判断一个数组字段中是否含有某个值
eg:select movie _name,array_contains(actors,‘吴刚’) from t_movie;
select array_contains(array(‘a’,‘b’,‘c’),‘c’);
select array_contains(array(‘a’,‘b’,‘c’),‘d’);
hive基础语法（2）
b. sort_array(array) 返回排序后的数组
eg: select sort_array(array(‘c’,‘b’,‘a’));
select ‘haha’,sort_array(array(‘c’,‘b’,‘a’)) as xx from (select 0) as tmp;

c. size(Array) 返回一个int值
eg: select name,size(actors) as actor_number from t_movie_detail;
hive基础语法（2）
更多：size（Map<K,V>）返回一个int值
map_keys(Map<K,V>) 返回一个数组
map_values(Map<K,V>) 返回一个数组
10.1.7、条件控制函数
10.1.7.1、 case when
语法：
CASE [ expression ]
WHEN condition1 THEN result1
WHEN condition2 THEN result2
…
WHEN conditionn THEN resultn
ELSE result
END
eg：select id,name,
case
when info.age<28 then ‘youngth’
when info.age>27 and info.age<40 then ‘middle’
else ‘old’
end
from t_user;
hive基础语法（2）
10.1.7.2 IF
eg：select id,if(age>25,‘working’,‘worked’)
from
t_user;

select name,if(array_contains(actors,‘吴刚’),‘good’,‘terriable’)
from
t_movie_detail;

10.1.8、json解析函数：表生成函数
a. json_tuple函数
eg：
create table t_rate2
as
select json_tuple(json,‘movie’,‘rate’,‘timeStamp’,‘uid’) as (movie,rate,ts,uid) from t_ratting;
产生结果：
hive基础语法（2）
10.1.9、窗口分析函数：row_number() over()函数：分组TOPN
10.1.9.1、需求
有如下数据
1,18,a,male
2,19,b,male
3,22,c,female
4,16,d,female
5,30,e,male
6,26,f,female
需要查询出每种性别中年龄最大的2条数据
10.1.9.2、实现
使用row_number函数，对表中的数据按照性别分组，按照年龄倒序排序并进行标记
准备数据：
hive基础语法（2）
hql代码：
select id,age,name,sex,row_number() over(partition by sex order by age desc) as rank
from t_rownumber
结果：

再利用上面的结果，查询出rank<=2的即为最终需求
select id,age,name,sex
from
(select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rank
from t_rownumber) tmp
where rank<=2;
hive基础语法（2）
窗口分析函数应用场景--------级联报表查询：
1）、有如下数据：
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
C,2015-01,10
C,2015-01,20
A,2015-02,4
A,2015-02,6
C,2015-02,30
C,2015-02,10
B,2015-02,10
B,2015-02,5
A,2015-03,14
A,2015-03,6
B,2015-03,20
B,2015-03,25
C,2015-03,10
C,2015-03,20
2）、建表映射
create table t_access_times(username string,month string,counts int)
row format delimited fields terminated by ‘,’;
3）、加载本地数据到表中
load data local inpath ‘/root/hivetest/access_time.dat’ into table t_access_times; hive基础语法（2）
4）、需求：需要开发出sql脚本，来统计出如下累计报表：
方案一：传统思想
第一步：先得出如下数据报表：
select username,month,sum(counts) as all_counts
from
t_access_times
group by username,month;

第二步：create table t_access_amounts
as
select username,month,sum(counts) as amounts
from
t_access_times
group by username,month; hive基础语法（2）
第三步：join连接，将总额表自连接
连接条件是username相同
create table t_access_amount_step2
as
(select a.,b.
from
t_access_amount as a
inner join
t_access_amount as b
on
a.username=b.username);
第四步：再次分组聚合
select a.,b.
from
t_access_amount as a inner join t_access_amount as b
on a.username=b.username
where
a.month <= b.month;
hive基础语法（2）
第五步：先建立一个临时表
create table t_access_tmp
as
select b.*,a.amount as a_amount
from
t_access_amount as a inner join t_access_amount as b
on a.username=b.username
where
a.month <= b.month;

第六步：再次分组聚合，根据第五步表中的username和month
select username,month,amount,sum(a_amount) as total
from
t_access_tmp
group by
username,month,amount;
hive基础语法（2）
方案二：简单一点的办法—窗口分析函数
第一步：先得到每个月的总和
select * from t_access_amount;

第二步：----窗口分析函数—按照username分区分成不同的窗口，累加当前行和前面所有行
select uid,month,amount,
sum(amount) over(partition by username order by month rows between unbounded preceding and current row)
as
accumulate
from t_access_amount;
hive基础语法（2）

hive基础语法（2）

相关推荐