大数据第三课：Hive的使用

首先要学习Hive，第一步是了解Hive，Hive是基于Hadoop的一个数据仓库，可以将结构化的数据文件映射为一张表，并提供类sql查询功能，Hive底层将sql语句转化为mapreduce任务运行。相对于用java代码编写mapreduce来说，Hive的优势明显：快速开发，人员成本低，可扩展性（自由扩展集群规模），延展性（支持自定义函数）。

Hive的构架：

Hive提供了三种用户接口：CLI、HWI和客户端。客户端是使用JDBC驱动通过thrift，远程操作Hive。HWI即提供Web界面远程访问Hive。但是最常见的使用方式还是使用CLI方式。（在linux终端操作Hive）

Hive有三种安装方式：

1、内嵌模式（元数据保村在内嵌的derby种，允许一个会话链接，尝试多个会话链接时会报错，不适合开发环境）

2、本地模式（本地安装mysql 替代derby存储元数据）

3、远程模式（远程安装mysql 替代derby存储元数据）

安装Hive：（本地模式）

首先Hive的安装是在Hadoop集群正常安装的基础上，并且集群启动

安装Hive之前我们要先安装mysql，

查看是否安装过mysql：rpm -qa|grep mysql*

查看有没有安装包：yum list mysql*

安装mysql客户端：yum install -y mysql

安装服务器端：yum install -y mysql-server

yum install -y mysql-devel

启动数据库 service mysqld start或者/etc/init.d/mysqld start

创建hadoop用户并赋予权限：

mysql>grant all on *.* to [email protected]'%' identified by 'hadoop';

mysql>grant all on *.* to [email protected]'localhost' identified by 'hadoop';

mysql>grant all on *.* to [email protected]'master' identified by 'hadoop';

mysql>flush privileges;

然后在Hive官网上下载需要的版本，hive.apache.org archive.apache.org

解压：tar -zxvf apache-hive-1.2.1-bin.tar.gz

配置：cd /apache-hive-1.2.1-bin/conf/ vim hive-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>hive.metastore.local</name>

</property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://master:3306/hive?characterEncoding=UTF-8</value>

</property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

</property>

<name>javax.jdo.option.ConnectionUserName</name>

<value>hadoop</value>

</property>

<name>javax.jdo.option.ConnectionPassword</name>

<value>hadoop</value>

</property>

</configuration>

复制依赖包：cp mysql-connector-java-5.1.43-bin.jar apache-hive-1.2.1-bin/lib/

配置环境变量：

export HIVE_HOME=$PWD/apache-hive-1.2.1-bin

export PATH=$PATH:$HIVE_HOME/bin

启动hive：hive

hive中可以运行shell命令:! shell命令

大数据第三课：Hive的使用

hive中可以运行hadoop命令：

hive中的数据类型：

原子数据类型：TINYINT SMALLINT INT BIGINT FLOAT DOUBLE BOOLEAN STRING

复杂数据类型：STRUCT MAP ARRAY

hive的使用：

建表语句：

DDL：

创建内部表：

create table mytable(

id int,

name string)

row format delimited fields terminated by '\t' stored as textfile;

常见外部表：关键字 external

create external table mytable2(
id int,
name string)
row format delimited fields terminated by '\t' location '/user/hive/warehouse/mytable2';

创建分区表：分区字段要写在partiton by（）

create table mytable3(
id int,
name string)
partitioned by(sex string) row format delimited fields terminated by '\t'stored as textfile;

静态分区插入数据

load data local inpath '/root/hivedata/boy.txt' overwrite into table mytable3 partition(sex='boy');

增加分区：

alter table mytable3 add partition (sex='unknown') location '/user/hive/warehouse/mytable3/sex=unknown';

删除分区：alter table mytable3 drop if exists partition(sex='unknown');

分区表默认为静态分区，可转换为自动套分区

set hive.exec.dynamic.partition=true;

set hive.exec.dynamic.partition.mode=nonstrict;

给分区表灌入数据：

insert into table mytable3 partition (sex) select id,name,'boy' from student_mdf;

查询表分区：show partitions mytable3;

查询分区表数据：select * from mytable3;

查询表结构：desc mytable3;

DML:

重命名表：alter table student rename to student_mdf

增加列：alter table student_mdf add columns (sex string);

修改列名：alter table student_mdf change sex gender string;

替换列结构：alter table student_mdf replace columns (id string, name string);

装载数据：（本地数据）load data local inpath '/home/lym/zs.txt' overwrite into student_mdf;

（HDFS数据）load data inpath '/zs.txt' into table student_mdf;

插入一条数据：insert into table student_mdf values('1','zhangsan');

创建表接收查询结果：create table mytable5 as select id, name from mytable3;

导出数据：（导出到本地）insert overwrite local directory '/root/hivedata/mytable5.txt' select * from mytable5;

（导出到HDFS）

insert overwrite directory 'hdfs://master:9000/user/hive/warehouse/mytable5_load' select * from mytable5;

数据查询：

select * from mytable3; 查询全表

select uid,uname from student; 查询学生表中的学生姓名与学号字段

select uname,count(*) from student group by uname; 统计学生表中每个名字的个数

常用的功能还有 having、order by、sort by、distribute by、cluster by；等等

关联查询中有

内连接：将符合两边连接条件的数据查询出来

select * from t_a a inner join t_b b on a.id=b.id;

左外连接：以左表数据为匹配标准，右边若匹配不上则数据显示null

select * from t_a a left join t_b b on a.id=b.id;

右外连接：与左外连接相反

select * from t_a a right join t_b b on a.id=b.id;

左半连接：左半连接会返回左边表的记录，前提是其记录对于右边表满足on语句中的判定条件。

select * from t_a a left semi join t_b b on a.id=b.id;

全连接(full outer join)：

select * from t_a a full join t_b b on a.id=b.id;

in/exists关键字(1.2.1之后新特性)：效果等同于left semi join

select * from t_a a where a.id in (select id from t_b);

select * from t_a a where exists (select * from t_b b where a.id = b.id);

shell操作Hive指令：

-e：从命令行执行指定的HQL:

-f：执行HQL脚本

-v：输出执行的HQL语句到控制台

内置函数

查看内置函数：show functions;

显示函数的详细信息：DESC FUNCTION abs;

重要常用内置函数：sum()--求和 count()--求数据量 avg()--求平均值

distinct--去重 min--求最小值 max--求最大值

自定义函数：

1.先开发一个简单的Java类，org.apache.hadoop.hive.ql.exec.UDF，重载evaluate方法

import org.apache.hadoop.hive.ql.exec.UDF;

public final class AddUdf extends UDF {

public Integer evaluate(Integer a, Integer b) {

if (null == a || null == b) {

return null;

} return a + b;

}

public Double evaluate(Double a, Double b) {

if (a == null || b == null)

return null;

return a + b;}

}

2.打成jar包上传到服务器

3.将jar包添加到hive add jar /home/lan/jar/addudf.jar;

4.创建临时函数与开发好的class关联起来

CREATE TEMPORARY FUNCTION add_example AS 'org.day0914.AddUdf';

5.使用自定义函数 SELECT add_example(scores.math, scores.art) FROM scores;

销毁临时函数：DROP TEMPORARY FUNCTION add_example;

Hive相关工具：Sqoop Azkaban Flume

大数据第三课：Hive的使用

相关推荐