Hive 的安装与配置

前言

在前面的章节内, 我们主要介绍了Hadoop体系内的相关知识. 本章中, 我们介绍下Hive的安装与使用.

什么是Hive?
个人认为, Hive是用来简化Hadoop的Map/Reduce操作的工具架构. 方便统计与搜查Hadoop表形式文件. 后继者Spark SQL与其特别的相似.

安装

Hive的安装主要包括如下几步:

下载安装包到本地, 解压. http://mirror.bit.edu.cn/apache/hive/hive-2.3.4/
版本间的差异可以看下这篇文章.Hive 各版本关键新特性（Key New Feature）介绍
配置hive-env.sh. 主要配置HADOOP_HOME参数.

HADOOP_HOME=/Users/Sean/Software/hadoop/hadoop-2.7.5

配置hive-site.xml. 配置Hive表结构的存储地址. (Hive自带deby类型的临时文件, 但是在集群环境上不是特别好用, 且更换目录后数据会丢失. 所以, 通常还是使用的mysql居多.)

<configuration>  
  <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
      <description>JDBC connect string for a JDBC metastore</description>
    </property>

    <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>com.mysql.jdbc.Driver</value>
      <description>Driver class name for a JDBC metastore</description>
    </property>

    <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>root</value>
      <description>username to use against metastore database</description>
    </property>

    <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>admin</value>
      <description>password to use against metastore database</description>
    </property>
</configuration>

拷贝Jar包(mysql-connector-5.1.8.jar)到lib目录下. Jar包可以去mvnrepository内进行下载.
启动并检查. 进入/bin目录, 运行命令show tables.

localhost:bin Sean$ ./hive -version
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/Sean/Software/Hive/apache-hive-2.3.4-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/Sean/Software/hadoop/hadoop-2.7.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/Users/Sean/Software/Hive/apache-hive-2.3.4-bin/lib/hive-common-2.3.4.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> show databases;
show databases
OK
default
Time taken: 3.882 seconds, Fetched: 1 row(s)

Hive 的安装与配置

测试使用

创建database与table.

# 创建 database
hive> create database flow;
create database flow
OK
Time taken: 0.234 seconds
# 使用database
hive> use flow;
use flow
OK
Time taken: 0.04 seconds
# 创建table
hive> create table flowcount(phone int ,upFlow int,downFlow int) row format delimited fields terminated by ' ';
create table flowcount(phone int ,upFlow int,downFlow int) row format delimited fields terminated by ' '
OK
Time taken: 0.582 seconds
hive> show tables
    > ;
show tables

OK
flowcount
Time taken: 0.067 seconds, Fetched: 1 row(s)

上传文件到HDFS, 也就是Hive的存储目录(``).

hadoop fs -put flowdata /user/hive/warehouse/flow.db/flowcount

Hive 的安装与配置
我们可以使用load命令直接添加到文件后面. 当然也可以通过hadoop fs -put命令直接上传覆盖文件.
如果没有使用database的表存储目录为/user/hive/warehouse/后, 使用database的存储空间在/user/hive/warehouse/<databasename>.db/<tablename>.(例如: /user/hive/warehouse/flow.db/flowcount/flowdata)

文件附录

1836294111 10 20
1836294112 1  2
1836294113 2 3
1836294114 3 4
1836294115 2 1
1836294116 1 2

查询

# 查询所有文件信息
hive> select * from flowcount;
select * from flowcount
OK
1836294111	10	20
1836294112	1	NULL
1836294113	2	3
1836294114	3	4
1836294115	2	1
1836294116	1	2
Time taken: 0.143 seconds, Fetched: 6 row(s)
# 条件查询(也没有走 yarn上的 map/reduce)
hive> select * from flowcount where upFlow>3;
select * from flowcount where upFlow>3
OK
1836294111	10	20
Time taken: 0.447 seconds, Fetched: 1 row(s)

排序 map/reduce.

ive> select phone,upFlow from flowcount order by upFlow;
select phone,upFlow from flowcount order by upFlow
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = Sean_20190404165309_cad5f7f9-e8a6-4cef-ac3a-4596051eb5ab
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1553933297569_0004, Tracking URL = http://localhost:8088/proxy/application_1553933297569_0004/
Kill Command = /Users/Sean/Software/hadoop/hadoop-2.7.5/bin/hadoop job  -kill job_1553933297569_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-04-04 16:53:19,149 Stage-1 map = 0%,  reduce = 0%
2019-04-04 16:53:24,662 Stage-1 map = 100%,  reduce = 0%
2019-04-04 16:53:32,047 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_1553933297569_0004
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   HDFS Read: 7209 HDFS Write: 238 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1836294116	1
1836294112	1
1836294115	2
1836294113	2
1836294114	3
1836294111	10
Time taken: 23.283 seconds, Fetched: 6 row(s)

Hive 的安装与配置
其实排序操作的本质, 与我们之前操作的WriterableComparable一样. Hive通过引擎自动帮我们补全了这段的代码.

Tips

异常1: hive> show databases; show databases FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
解决措施: 初始化hive, 将mysql作为hive的元数据库 schematool -dbType mysql -initSchema

与传统数据库的对比:
Hive: HQL/HDFS/MapReduce/高延迟/数据规模大/0.8后图索引
RDBMS: SQL/Raw Device or local FS/ Executor / 低延迟 / 数据规模小 / 复杂索引

Reference

[1]. Hive安装与配置
[2]. 搭建Hive所遇过的坑
[3]. Hive2.x 版本的安装及配置以及要注意的事项
[4]. hadoop/hive/hbase 简单区别及应用场景
[5]. Hive是什么，Hive与关系型数据库的区别
[6]. hive存储处理器（StorageHandlers）以及hive与hbase整合

Hive 的安装与配置

前言

安装

测试使用

Tips

Reference

相关推荐