[Hive07]Hive函数、join、执行计划

build-in functions 内置函数

1、show functions;查看所有函数

2、hive>desc function substr;查看函数substr具体信息

substr(str, pos[, len]) - returns the substring of str that starts at pos and is of length len orsubstr(bin, pos[, len]) - returns the slice of byte array that starts at pos and is of length len

3、hive>desc function extended substr;查看函数substr超具体信息

substr(str, pos[, len]) - returns the substring of str that starts at pos and is of length len orsubstr(bin, pos[, len]) - returns the slice of byte array that starts at pos and is of length len
Synonyms: substring
pos is a 1-based index. If pos<0 the starting position is determined by counting backwards from the end of str.
Example:
> SELECT substr('Facebook', 5) FROM src LIMIT 1;
'book'
> SELECT substr('Facebook', -5) FROM src LIMIT 1;
'ebook'
> SELECT substr('Facebook', 5, 1) FROM src LIMIT 1;

'b'

4、hive> desc function extended concat;查看函数concat超具体信息
OK
concat(str1, str2, ... strN) - returns the concatenation of str1, str2, ... strN or concat(bin1, bin2, ... binN) - returns the concatenation of bytes in binary data bin1, bin2, ... binN
Returns NULL if any argument is NULL.
Example:
> SELECT concat('abc', 'def') FROM src LIMIT 1;
'abcdef'

5、hive> desc function extended lower;查看函数lower超具体信息
OK
lower(str) - Returns str with all characters changed to lowercase
Synonyms: lcase
Example:
> SELECT lower('Facebook') FROM src LIMIT 1;
'facebook'

6、cast函数

binary类型==>string

binary不能直接转int,可以间接转

binary==>string==>int

select ename,edept,sal,cast(sal as int)from emp;

自定义函数UDF

1、种类：

UDF: one2one concat/lower 生产上用的最多
UDAF: many2one count/max......
UDTF: one2many lateral view explode

2、

开发UDF
1）继承UDF类
2）重写evaluate方法

建议：
1）要有返回值
2）参数类型使用Hadoop的类型

3、开发UDF函数步骤

1）写好具体的代码，将其打包成一个jar包

2）将jar上传自某个$HADOOP_HOME/lib目录下，这个目录可以用来专门存放jar包

secureCRT 上可用在$HADOOP_HOME/lib目录下rz -be 上传

3）hive> add jar jar包绝对路径将jar添加至hive的classpath

执行list jars;可以查看添加了多少jar包

4）hive>create （temporary） function say_hello as 'com.ruozedata.udf.HelloUDF'
类名全路径

临时表只作用于当前会话，开新会话这个函数会失效

此时执行show functions;可以看到自定义的函数

5、JOIN

A表：

id name

1 lily

2 candy

3 jack

B表

id age

1 21

2 22

4 19

1) (inner)join

select A.*,B.age from A join B on A.id=B.id;

1 lily 21

2 candy 22

2) left join

select A.*,B.age from A left join B on A.id=B.id;

1 lily 21

2 candy 22

3 jack null

3) right join

select A.*,B.age from A right join B on A.id=B.id;

1 lily 21

2 candy 22

null null 19

没有select B.id所以为4的那个id是null值

select A.*,B.* from A right join B on A.id=B.id;

1 lily 21

2 candy 22

4 null 19

4) full join 全部完全打印

select A.*,B.age from A left join B on A.id=B.id;

1 lily 21

2 candy 22

3 jack null

null null 19

5) cross join 笛卡尔积

select A.*,B.age from A left join B on A.id=B.id;

1 lily 21

2 candy 22

select A.*,B.age from A left join B ;

1 lily 21

1 lily 22

1 lily2 19

2 candy 21

2 candy 22

2 candy 19

3 jack 21

3 jack 22

3 jack 19

6、hive的join：

select e.empno, e.ename, e.deptno, d.dname from emp e join ruoze_dept d on e.deptno = d.deptno ;

shuffle：就是把相同的deptno分到一个reduce上去
emp: <deptno, (e.empno, e.ename)>
ruoze_dept: <deptno, (d.dname)>

1）common join/reduce join/shuffle join
[Hive07]Hive函数、join、执行计划

原理图背下来

reduce join中的mapper会读取各表数据，然后交给shuffle
shuffle把数据以相同的key进行分组，
然后交由reducer处理

两张表mapper为2

2）mapjoin
“hive.auto.convert.join”这个参数在hive中默认为true，会将common join转成map join

hive>set hive.auto.convert.join=false;
[Hive07]Hive函数、join、执行计划

原理图背下来
在内存中完成，只有mapper没有reducer，性能更高
两张表mapper为1

8、hive中的执行计划

1）

hive>set hive.auto.convert.join=false;

explain (extended)+query语句

2）

hive>set hive.auto.convert.join=true;

explain (extended)+query语句

1）和2）的执行计划非常不同，1)对应common join原理图，2）对应map join 原理图

若泽数据交流群:707635769

【来自@若泽大数据】

[Hive07]Hive函数、join、执行计划

相关推荐