hive sql练习

用来作练习的两张表的结构及部分数据如下:可以看到,1个用户可以有很多个订单,每个订单里拥有很多个商品,
hive sql练习题目一:每个用户有多少个订单

思路:在orders表中按照user_id分组,组内统计订单个数(去重)

select user_id, count(distinct order_id) as ord_cnt
from orders
group by user_id
order by ord_cnt desc
limit 20;

题目二:每个用户平均每个订单有多少商品

思路:先根据order_products_prior表算出每个订单有多少商品,按照order_id分组统计商品数,

selct  order_id, count(product_id) as prod_cnt
from order_products_prior
group by order_id

再根据orders表和上述结果作join,算出每个用户(用户是重复的)的每个订单有多少商品,然后按照user_id分组,组内计算每个用户的商品数和订单数,求平均

select t1.user_id, sum(t2.prod_cnt)/count(t1.order_id) 
from orders t1 
join(
   selct  order_id, count(product_id) as prod_cnt
   from order_products_prior
   group by order_id
) t2 on t1.order_id = t2.order_id
group by t1.user_id
limit 20;

或者在作join后使用avg函数去求平均值,需要注意的是order_id要没有重复才可用avg

select t1.user_id, avg(t2.prod_cnt) 
from orders t1 
join(
   selct  order_id, count(product_id) as prod_cnt
   from order_products_prior
   group by order_id
) t2 on t1.order_id = t2.order_id
group by t1.user_id
limit 20;

题目三:每个用户在一周中购买订单数量的分布,形如user_id,dow_0,dow_1,…,dow_6

思路:对orders表,先按user_id分组,然后使用case when语句增加dow_0, dow_, …, dow_6这7列字段,对每个新增字段比如dow_0,遍历每行记录作判断,如果该行的order_dow值为0,增对应的dow_0字段的值为1,否则为0;最后组内统计每个新增字段的加和值
hive sql练习

select user_id,
sum(case order_dow when '0' then 1 else 0 end) as dow_0,
sum(case order_dow when '1' then 1 else 0 end) as dow_1,
sum(case order_dow when '2' then 1 else 0 end) as dow_2,
sum(case order_dow when '3' then 1 else 0 end) as dow_3,
sum(case order_dow when '4' then 1 else 0 end) as dow_4,
sum(case order_dow when '5' then 1 else 0 end) as dow_5,
sum(case order_dow when '6' then 1 else 0 end) as dow_6
from orders
group by user_id
limit 20;

或者使用count,这时不满足判定条件的不赋值

select user_id,
count(case order_dow when '0' then 1 end) as dow_0,
count(case order_dow when '1' then 1 end) as dow_1,
count(case order_dow when '2' then 1 end) as dow_2,
count(case order_dow when '3' then 1 end) as dow_3,
count(case order_dow when '4' then 1 end) as dow_4,
count(case order_dow when '5' then 1 end) as dow_5,
count(case order_dow when '6' then 1 end) as dow_6
from orders
group by user_id
limit 20;

这里再设想一下,如果orders表中同一天内有重复的order_id,那么then就应该改为then order_id,最后的聚合函数改为count(distinct case…),不过这种情况不太可能会发生,一个用户每次购买的订单,都会有唯一的订单id

题目四:(把days_since_prior_order当作date,会有重复)计算每个用户平均一天(购买那几天)购买商品数量

day1: 2个订单: 订单1有 10个product,订单2有15个product,则共有25个
day2:一个订单 ,12个product
day3:没有购买

user_id平均一天购买商品数为 (25+12)/2

思路:因为orders表中days_since_prior_order字段有的是空字符’’,但是该条记录是一次购买行为,所以要先处理空字符’’,为计算每个用户(user_id有重复的)的购买天数作准备

select user_id, if(days_since_prior_order='','0.0',days_since_prior_order) as dt
from orders 

然后求出每个用户(去重)的购买天数

select t.user_id, count(distinct t.dt) as day_cnt
from (
   select user_id, if(days_since_prior_order='','0.0',days_since_prior_order) as dt
   from orders 
) t
group by  t.user_id
order by cast(t.user_id as int)
limit 20;

接着根据order_products_prior表计算出每个订单有多少商品

select order_id, count(product_id) as prod_cnt
from order_products_prior
group by order_id

最后将处理空字符后的结果与上述结果根据order_id作join,得到user_id(重复),order_id, dt, prod_cnt,再根据user_id分组,组内计算每个用户购买的商品总数和购买天数

select t1.user_id, sum(t2.prod_cnt)/count(distinct t1.dt) as avg_day_prod_cnt
from (
      select user_id, order_id, if(days_since_prior_order='','0.0',days_since_prior_order) as dt
      from orders 
) t1 join(
      select order_id, count(product_id) as prod_cnt
      from order_products_prior
      group by order_id
) t2 on t1.order_id=t2.order_id
group by t1.user_id
limit 20;

或者使用avg函数处理去重后的天数问题,在上述join后按user_id,dt分组,组内求出商品总数,最后按user_id分组,对prod_cnt使用avg函数

select t3.user_id, avg(t3.prod_cnt) as avg_day_prod_cnt
from(
       select t1.user_id, t1.dt, sum(t2.prod_cnt) as prod_cnt
       from (
             select user_id, order_id, if(days_since_prior_order='','0.0',days_since_prior_order) as dt
             from orders 
       ) t1 join(
             select order_id, count(product_id) as prod_cnt
             from order_products_prior
             group by order_id
       ) t2 on t1.order_id=t2.order_id
      group by t1.user_id, t1.dt
)t3
group by t3.user_id
limit 20;

题目五:每个用户最喜爱购买的三个product是什么,最终表结构可以是3个列,或者是一个字符串
user_id | top1 | top2 | top3
123 | prod_23 | prod_12 | prod_35

先根据orders表和order_products_prior表根据order_id作join,得出每个用户(user_id重复)每个订单每个商品,再根据user_id,product_id分组,计算每组的个数即用户购买该商品的次数(user_id, product_id, prod_buy_cnt)

select t1.user_id, t2.product_id, count(1) as prod_buy_cnt
from orders t1
join order_products_prior t2
on t1.order_id=t2.order_id
group by t1.user_id,t2.product_id

接下来涉及到组内排序,这里先介绍row_number() over()排序功能


row_number() over()

在使用 row_number() over()函数时候,over()里头的分组以及排序的执行晚于 where group by order by 的执行。

partition by 用于给结果集分组,如果没有指定那么它把整个结果集作为一个分组,它和聚合函数不同的地方在于它能够返回一个分组中的多条记录,而聚合函数一般只有一个反映统计值的记录。

例如:employee,根据部门分组排序
hive sql练习
对查询结果进行排序:(无分组)
hive sql练习
row_number() over()返回的是排序后的顺序编号(从1开始)

rank() over()

rank() over() 是跳跃排序,有两个第二名时接下来就是第四名(同样是在各个分组内)
hive sql练习
dense_rank() over()
dense_rank() over()是连续排序,有两个第二名时仍然跟着第三名。相比之下row_number是没有重复值的
hive sql练习


再回到我们的题目上,接下来使用row_number对一个用户的不同商品,按照购买次数进行降序排列

select t3.user_id, t3.product_id, t3.prod_buy_cnt,
row_number() over (partirion by t3.user_id order by t3.prod_buy_cnt desc) as row_num
from(
        select t1.user_id, t2.product_id, count(1) as prod_buy_cnt
        from orders t1
        join order_products_prior t2
        on t1.order_id=t2.order_id
        group by t1.user_id,t2.product_id
) t3

最后对上述结果取row_num<4,并按user_id分组后,使用collect_list函数将每个user_id购买数量最多的前三个product_id及其对应的row_num用分隔符’_'拼接后放到一个数组里

select t.user_id, collect_list(concat_ws('_', t.product_id, cast(t.row_num as string))) as top_3_prods
from
(
		select t3.user_id, t3.product_id, t3.prod_buy_cnt,
		row_number() over (partirion by t3.user_id order by t3.prod_buy_cnt desc) as row_num
		from(
		        select t1.user_id, t2.product_id, count(1) as prod_buy_cnt
		        from orders t1
		        join order_products_prior t2
		        on t1.order_id=t2.order_id
		        group by t1.user_id,t2.product_id
		) t3
)t
where t.row_num<4
group by t.user_id
limit 20;