hive中distinct和group by优化

1、避免使用count distinct ,容易引起性能问题
select distinct(user_id) from a ;
由于必须去重，因此Hive会把map阶段的输出全部分布到一个reduce task中，容易引起性能问题，可以通过先group by ,再count得方式进行优化
优化后：select count(*)
from(
select user_id from a group by user_id
)tmp

2、group by引起得倾斜
比如按照销售商都销售明细表来进行统计订单数，那么部分大供应商的订单量就非常多，而多数供应商得订单量就一般，由于Group by 得时候是按照供货商得Id分发到每个reduce task,那么分配到大供应商得reduce task就分配了更多得订单，从而导致数据倾斜。
优化措施：set hive.map.aggr = true
set hive.groupby.skewindata=true
hive中distinct和group by优化