hive基础语法(2)
10.1.6、表生成函数
a. 行转列函数:explode()
1)、准备数据:假如有以下数据
1,zhangsan,化学:物理:数学:语文
2,lisi,数学:生物:生理:卫生
3,wangwu,化学:语文:英语:体育:生物
2)、映射成一张表:
create table t_stu_stubject(id int,name string,subjects arrays)
row format delimited fields terminated by “,”
collection items terminated by “:”;
3)、加载本地数据到数据表中
load data local inpath “/root/hivetest/stu_sub.data” into table t_stu_subject;
4)、使用explode()对数组字段“炸裂”
如果不使用explode()函数:
select * from t_stu_subject;
使用了explode()函数:
select explode(subjects) from t_stu_subject;
然后我们利用这个explode()的结果,来求去重的课程;
select distinct tmp.sub
from
(select explode(subjects) as sub from t_stu_subject) as tmp;
b. 表生成函数lateral view
1)、引用上张表t_stu_subject
select id,name,tmp.sub
from
t_stu_subject lateral view explode(subjects) tmp as sub;
原理:lateral view相当于两个表在join
左表是原表,右表是explode(某个集合字段)之后产生的表
而且:这个join只在同一行的数据间进行
那样,可以方便做更多的查询:
比如,查询选修了生物的同学
select a.id,a.name,a.sub
from
(select id,name,tmp.sub as sub from t_stu_subject lateral view explode(subjects) tmp as sub) as a
where sub=‘生物’;
案例:有以下文本文件:
hello tom hello jim
hello rose hello tom
tom love rose rose love jim
jim love tom love is what
what is love
需要用hive做wordcount
第一步:建表映射
create table t_wc(sentence string);
第二步:加载本地数据
load data local inpath ‘/root/hivetest/hive_wc.dat’ into table t_wc;
select split(sentence,’ ‘) from t_wc;
select explode(split(sentence,’ ‘)) as word from t_wc;
第三步:做wordcount
hql答案:
select words.word,count(1)
from
(select explode(split(setence,’ ‘)) as word from t_wc) words
group by
words.word;
按照出现的次数进行排序
select words.word,count(1) as cnts
from
(select explode(split(sentence,’ ')) as word from t_wc) as words
group by
words.word order by cnts asc;
10.1.7、集合函数
a. array_contains(Array,value) 返回boolean值 --> 判断一个数组字段中是否含有某个值
eg:select movie _name,array_contains(actors,‘吴刚’) from t_movie;
select array_contains(array(‘a’,‘b’,‘c’),‘c’);
select array_contains(array(‘a’,‘b’,‘c’),‘d’);
b. sort_array(array) 返回排序后的数组
eg: select sort_array(array(‘c’,‘b’,‘a’));
select ‘haha’,sort_array(array(‘c’,‘b’,‘a’)) as xx from (select 0) as tmp;
c. size(Array) 返回一个int值
eg: select name,size(actors) as actor_number from t_movie_detail;
更多:size(Map<K,V>)返回一个int值
map_keys(Map<K,V>) 返回一个数组
map_values(Map<K,V>) 返回一个数组
10.1.7、条件控制函数
10.1.7.1、 case when
语法:
CASE [ expression ]
WHEN condition1 THEN result1
WHEN condition2 THEN result2
…
WHEN conditionn THEN resultn
ELSE result
END
eg:select id,name,
case
when info.age<28 then ‘youngth’
when info.age>27 and info.age<40 then ‘middle’
else ‘old’
end
from t_user;
10.1.7.2 IF
eg:select id,if(age>25,‘working’,‘worked’)
from
t_user;
select name,if(array_contains(actors,‘吴刚’),‘good’,‘terriable’)
from
t_movie_detail;
10.1.8、json解析函数:表生成函数
a. json_tuple函数
eg:
create table t_rate2
as
select json_tuple(json,‘movie’,‘rate’,‘timeStamp’,‘uid’) as (movie,rate,ts,uid) from t_ratting;
产生结果:
10.1.9、窗口分析函数:row_number() over()函数:分组TOPN
10.1.9.1、需求
有如下数据
1,18,a,male
2,19,b,male
3,22,c,female
4,16,d,female
5,30,e,male
6,26,f,female
需要查询出每种性别中年龄最大的2条数据
10.1.9.2、实现
使用row_number函数,对表中的数据按照性别分组,按照年龄倒序排序并进行标记
准备数据:
hql代码:
select id,age,name,sex,row_number() over(partition by sex order by age desc) as rank
from t_rownumber
结果:
再利用上面的结果,查询出rank<=2的即为最终需求
select id,age,name,sex
from
(select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rank
from t_rownumber) tmp
where rank<=2;
窗口分析函数应用场景--------级联报表查询:
1)、有如下数据:
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
C,2015-01,10
C,2015-01,20
A,2015-02,4
A,2015-02,6
C,2015-02,30
C,2015-02,10
B,2015-02,10
B,2015-02,5
A,2015-03,14
A,2015-03,6
B,2015-03,20
B,2015-03,25
C,2015-03,10
C,2015-03,20
2)、建表映射
create table t_access_times(username string,month string,counts int)
row format delimited fields terminated by ‘,’;
3)、加载本地数据到表中
load data local inpath ‘/root/hivetest/access_time.dat’ into table t_access_times;
4)、需求:需要开发出sql脚本,来统计出如下累计报表:
方案一:传统思想
第一步:先得出如下数据报表:
select username,month,sum(counts) as all_counts
from
t_access_times
group by username,month;
第二步:create table t_access_amounts
as
select username,month,sum(counts) as amounts
from
t_access_times
group by username,month;
第三步:join连接,将总额表自连接
连接条件是username相同
create table t_access_amount_step2
as
(select a.,b.
from
t_access_amount as a
inner join
t_access_amount as b
on
a.username=b.username); 第四步:再次分组聚合
select a.,b.
from
t_access_amount as a inner join t_access_amount as b
on a.username=b.username
where
a.month <= b.month;
第五步:先建立一个临时表
create table t_access_tmp
as
select b.*,a.amount as a_amount
from
t_access_amount as a inner join t_access_amount as b
on a.username=b.username
where
a.month <= b.month;
第六步:再次分组聚合,根据第五步表中的username和month
select username,month,amount,sum(a_amount) as total
from
t_access_tmp
group by
username,month,amount;
方案二:简单一点的办法—窗口分析函数
第一步:先得到每个月的总和
select * from t_access_amount;
第二步:----窗口分析函数—按照username分区分成不同的窗口,累加当前行和前面所有行
select uid,month,amount,
sum(amount) over(partition by username order by month rows between unbounded preceding and current row)
as
accumulate
from t_access_amount;