Hive 的安装与配置
前言
在前面的章节内, 我们主要介绍了Hadoop体系
内的相关知识. 本章中, 我们介绍下Hive的安装与使用.
什么是Hive?
个人认为, Hive是用来简化Hadoop
的Map/Reduce
操作的工具架构. 方便统计与搜查Hadoop
表形式文件. 后继者Spark SQL
与其特别的相似.
安装
Hive的安装主要包括如下几步:
- 下载安装包到本地, 解压. http://mirror.bit.edu.cn/apache/hive/hive-2.3.4/
版本间的差异可以看下这篇文章.Hive 各版本关键新特性(Key New Feature)介绍 - 配置
hive-env.sh
. 主要配置HADOOP_HOME
参数.
HADOOP_HOME=/Users/Sean/Software/hadoop/hadoop-2.7.5
- 配置
hive-site.xml
. 配置Hive
表结构的存储地址. (Hive自带deby类型的临时文件, 但是在集群环境上不是特别好用, 且更换目录后数据会丢失. 所以, 通常还是使用的mysql
居多.)
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>admin</value>
<description>password to use against metastore database</description>
</property>
</configuration>
-
拷贝Jar包(
mysql-connector-5.1.8.jar
)到lib
目录下. Jar包可以去mvnrepository
内进行下载. -
启动并检查. 进入
/bin
目录, 运行命令show tables
.
localhost:bin Sean$ ./hive -version
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/Sean/Software/Hive/apache-hive-2.3.4-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/Sean/Software/hadoop/hadoop-2.7.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/Users/Sean/Software/Hive/apache-hive-2.3.4-bin/lib/hive-common-2.3.4.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> show databases;
show databases
OK
default
Time taken: 3.882 seconds, Fetched: 1 row(s)
测试使用
- 创建
database
与table
.
# 创建 database
hive> create database flow;
create database flow
OK
Time taken: 0.234 seconds
# 使用database
hive> use flow;
use flow
OK
Time taken: 0.04 seconds
# 创建table
hive> create table flowcount(phone int ,upFlow int,downFlow int) row format delimited fields terminated by ' ';
create table flowcount(phone int ,upFlow int,downFlow int) row format delimited fields terminated by ' '
OK
Time taken: 0.582 seconds
hive> show tables
> ;
show tables
OK
flowcount
Time taken: 0.067 seconds, Fetched: 1 row(s)
上传文件到HDFS, 也就是Hive的存储目录(``).
hadoop fs -put flowdata /user/hive/warehouse/flow.db/flowcount
我们可以使用load
命令直接添加到文件后面. 当然也可以通过hadoop fs -put
命令直接上传覆盖文件.
如果没有使用database
的表存储目录为/user/hive/warehouse/
后, 使用database
的存储空间在/user/hive/warehouse/<databasename>.db/<tablename>
.(例如: /user/hive/warehouse/flow.db/flowcount/flowdata
)
文件附录
1836294111 10 20
1836294112 1 2
1836294113 2 3
1836294114 3 4
1836294115 2 1
1836294116 1 2
- 查询
# 查询所有文件信息
hive> select * from flowcount;
select * from flowcount
OK
1836294111 10 20
1836294112 1 NULL
1836294113 2 3
1836294114 3 4
1836294115 2 1
1836294116 1 2
Time taken: 0.143 seconds, Fetched: 6 row(s)
# 条件查询(也没有走 yarn上的 map/reduce)
hive> select * from flowcount where upFlow>3;
select * from flowcount where upFlow>3
OK
1836294111 10 20
Time taken: 0.447 seconds, Fetched: 1 row(s)
- 排序
map/reduce
.
ive> select phone,upFlow from flowcount order by upFlow;
select phone,upFlow from flowcount order by upFlow
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = Sean_20190404165309_cad5f7f9-e8a6-4cef-ac3a-4596051eb5ab
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1553933297569_0004, Tracking URL = http://localhost:8088/proxy/application_1553933297569_0004/
Kill Command = /Users/Sean/Software/hadoop/hadoop-2.7.5/bin/hadoop job -kill job_1553933297569_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-04-04 16:53:19,149 Stage-1 map = 0%, reduce = 0%
2019-04-04 16:53:24,662 Stage-1 map = 100%, reduce = 0%
2019-04-04 16:53:32,047 Stage-1 map = 100%, reduce = 100%
Ended Job = job_1553933297569_0004
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 HDFS Read: 7209 HDFS Write: 238 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1836294116 1
1836294112 1
1836294115 2
1836294113 2
1836294114 3
1836294111 10
Time taken: 23.283 seconds, Fetched: 6 row(s)
其实排序操作的本质, 与我们之前操作的WriterableComparable
一样. Hive
通过引擎自动帮我们补全了这段的代码.
Tips
异常1:
hive> show databases; show databases FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
解决措施:初始化hive, 将mysql作为hive的元数据库 schematool -dbType mysql -initSchema
与传统数据库的对比:
Hive:HQL
/HDFS
/MapReduce
/高延迟/数据规模大/0.8后图索引
RDBMS:SQL
/Raw Device or local FS
/Executor
/ 低延迟 / 数据规模小 / 复杂索引
Reference
[1]. Hive安装与配置
[2]. 搭建Hive所遇过的坑
[3]. Hive2.x 版本的安装及配置 以及要注意的事项
[4]. hadoop/hive/hbase 简单区别及应用场景
[5]. Hive是什么,Hive与关系型数据库的区别
[6]. hive存储处理器(StorageHandlers)以及hive与hbase整合