基于CentOS6的大数据开发环境部署-hadoop2.7-hive1.2-kettle6.0-spark2.3.4(一)
教材1:
《Hadoop构建数据仓库实践》
作 者:王雪迎 著
定 价:89
出 版 社:清华大学出版社
出版日期:2017年07月01日
页 数:434
装 帧:平装
ISBN:9787302469803
教材2:
《Spark大数据分析与实战(大数据技术与应用丛书)》
商品参数
49.00
出版社:清华大学出版社
版次1-1
出版时间2019-08-26
开本16
作者:黑马程序员
页数228
ISBN编码9787302534327
此集群的原有配置:
3台32位Centos6.10,32位jdk1.8.0_211,
hadoop002有:eclipse,idea,mysql5.6,hive-1.1.0-cdh5.7.0, scala-2.11.11
zookeeper-3.4.5-cdh5.7.0集群,
HA方式部署的hadoop-2.6.0-cdh5.7.0和hbase-1.2.0-cdh5.7.0.
以下步骤若无其它说明,则都是以root身份登录hadoop002节点后做的.
一,改建为Hadoop普通分布式集群:
改建原因:实践《Hadoop构建数据仓库实践》需用apache hadoop2.7.2全分布式(非HA)集群.
删除且仅删除各节点的hadoop-2.6.0-cdh5.7.0的安装目录;
解压Hadoop2.7.2到/root/app/hadoop-2.7.2 ;
chown和chgrp命令更改hadoop-2.7.2目录的属主和组为root;
创建目录:
# mkdir /root/app/hadoop-2.7.2/tmp
# mkdir /root/app/hadoop-2.7.2/dataDir
# mkdir /root/app/hadoop-2.7.2/nameDir
编写hadoop-env.sh
#gedit /root/app/hadoop-2.7.2/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/root/app/jdk1.8.0_211
编写core-site.xml
#gedit /root/app/hadoop-2.7.2/etc/hadoop/core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop002:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/root/app/hadoop-2.7.2/tmp</value>
</property>
</configuration>
编写hdfs-site.xml
#gedit /root/app/hadoop-2.7.2/etc/hadoop/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>/root/app/hadoop-2.7.2/dataDir</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/root/app/hadoop-2.7.2/nameDir</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>hadoop002:50070</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop002:50090</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
编辑mapred-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
编辑yarn-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop002</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
复制到其它节点:
# cd /root/app
# scp -r hadoop-2.7.2 hadoop001:/root/app
# scp -r hadoop-2.7.2 hadoop003:/root/app
格式化HDFS:
# hdfs namenode -format
输出信息为:
…略…
20/05/07 16:29:14 INFO common.Storage: Storage directory /root/app/hadoop-2.7.2/nameDir has been successfully formatted.
…略…
无异常信息,说明格式化成功!
启动集群,然后查看全部节点的jvm上的全部进程:
# start-all.sh
# alljps.sh
hadoop001:
1989 DataNode
2214 Jps
2090 NodeManager
hadoop002:
3492 ResourceManager
3046 NameNode
3174 DataNode
3912 Jps
3595 NodeManager
3341 SecondaryNameNode
hadoop003:
1943 DataNode
2168 Jps
2044 NodeManager
上传一个文件到hdfs的根目录
# hadoop dfs -put necklace.txt /
# hadoop dfs -ls /
…略…
-rw-r--r-- 2 root supergroup 16761 2020-05-07 16:56 /necklace.txt
二,用Kettle60操纵hadoop集群和zookeeper集群:
1,复制和解压kettle6.0到/root/app/kettle60
# unzip pdi-ce-6.0.1.0-386.zip -d /root/app/kettle60
复制hadoop集群的hdfs-site.xml和core-site.xml到Kettle的特定目录下:
#cd /root/app/kettle60/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh54
# cp /root/app/hadoop-2.7.2/etc/hadoop/hdfs-site.xml .
# cp /root/app/hadoop-2.7.2/etc/hadoop/core-site.xml .
回答yes
2,在hdfs上创建当前使用kettle的用户对应的目录.
# hadoop dfs -mkdir /user
# hadoop dfs -mkdir /user/root
3,编辑config.properties
# gedit config.properties
在末尾添加一行:
authentication.superuser.provider=NO_AUTH
4,修改kettle6.0的启动脚本spoon.sh
# cd /root/app/kettle60/data-integration
# gedit spoon.sh
1),将rm -rf /root/app/kettle60/data-integration/system/karaf/data1作为该文件第2行.
(被删除的目录的用途需要后续再研究).
2),找到”OPT=”(大概在201行),添加红色字符部分(双引号内不换行,短横线前要有空格)
OPT="$OPT $PENTAHO_DI_JAVA_OPTIONS -Dhttps.protocols=TLSv1,TLSv1.1,TLSv1.2 -Djava.library.path=$LIBPATH -DKETTLE_HOME=$KETTLE_HOME -DKETTLE_REPOSITORY=$KETTLE_REPOSITORY -DKETTLE_USER=$KETTLE_USER -DKETTLE_PASSWORD=$KETTLE_PASSWORD -DKETTLE_PLUGIN_PACKAGES=$KETTLE_PLUGIN_PACKAGES -DKETTLE_LOG_SIZE_LIMIT=$KETTLE_LOG_SIZE_LIMIT -DKETTLE_JNDI_ROOT=$KETTLE_JNDI_ROOT -Dorg.eclipse.swt.browser.DefaultType=mozilla"
5,启动Kettle:
(可将Kettle6.0的安装目录设置到$PATH中)
# spoon.sh
会弹出网页,点击网页底部的”Tutorials & Videos”后可见学习指南.
最小化网页,即可使用Kettle窗口.
点击”主对象树”:
6,连接apache-hive-1.2.1
见本文档的Hive1.2.1的部署说明部分.
7,连接zookeeper全分布式集群
1),启动zookeeper集群
#startzk.sh
2),kettle内设置ZooKeeper段的Hostname为hadoop002,Port为2181,即可连接成功.
三,安装redis2.8.6
参考:《NoSQL数据库技术实战》Page.195.
安装gcc
# yum install -y gcc
下载解压redis源代码
#cd /root/app/redis-2.8.6
# wget http://download.redis.io/releases/redis-2.8.6.tar.gz
# tar -zxvf redis-2.8.6.tar.gz
# cd redis-2.8.6
编译redis
# make
将redis启动程序复制到$PATH
# make install
编辑配置文件
# gedit /etc/redis286.conf
daemonize yes
pidfile /root/app/redis-2.8.6/redis286.pid
port 6379
timeout 300
loglevel debug
logfile /root/app/redis-2.8.6/redis286.log
databases 16
save 900 1
save 300 10
save 60 10000
rdbcompression yes
dbfilename redis286dump.rdb
dir /root/app/redis-2.8.6
appendonly no
appendfsync always
启动redis为后台进程
# redis-server /etc/redis286.conf
查看redis服务进程
# ps -ef |grep redis|grep -v grep
root 6015 1 0 11:54 ? 00:00:00 redis-server *:6379
四,apache-spark2.3.4全分布式部署
使用的安装程序: spark-2.3.4-bin-hadoop2.7.tgz
部署步骤:见《Spark大数据分析与实战(大数据技术与应用丛书)》p.39.
2020年5月14日星期四凌晨部署成功.
五,部署apache-Hive-1.2.1
1,安装MySql5.6
hadoop002已经安装过MySQL 5.6.45 Community版(mysql-connector-java-5.1.45,用户root的登录密码为123456),所以,直接安装Hive.
(原安装大致过程:
# wget http://repo.mysql.com/mysql-community-release-el6-5.noarch.rpm
# rpm -ivh mysql-community-release-el6-5.noarch.rpm
# yum install mysql-community-server -y
该mysql的配置文件为: /etc/my.cnf)
# ll /etc/my.cnf
-rw-r--r-- 1 root root 1066 6月 10 2019 /etc/my.cnf
原始内容:
# For advice on how to change settings please see
# http://dev.mysql.com/doc/refman/5.6/en/server-configuration-defaults.html
[mysqld]
#
# Remove leading # and set to the amount of RAM for the most important data
# cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%.
# innodb_buffer_pool_size = 128M
#
# Remove leading # to turn on a very important data integrity option: logging
# changes to the binary log between backups.
# log_bin
#
# Remove leading # to set options mainly useful for reporting servers.
# The server defaults are faster for transactions and fast SELECTs.
# Adjust sizes as needed, experiment to find the optimal values.
# join_buffer_size = 128M
# sort_buffer_size = 2M
# read_rnd_buffer_size = 2M
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
# Recommended in standard MySQL setup
sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES
[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
)
由于hadoop002上原先部署的hive-1.1.0-cdh5.7.0使用该mysql存储元数据,已经有数据库hive,所以,删除旧数据库hive:
# mysql -uroot -p
mysql> drop database hive;
2,安装Hive1.2
# tar -zxvf hive-1.2.1-bin.tar.gz -C ~/app/
#cd /root/app/apache-hive-1.2.1-bin
# mkdir iotmp
# cd conf
# cp hive-default.xml.template hive-site.xml
# gedit hive-site.xml
…其余不变…
大概在第362行:
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>
大概在第378行:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://127.0.0.1:3306/hive?createDatabaseIfNotExist=true&useSSL=false</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
…其余不变…
大概在第772行:
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
…其余不变…
大概在第797行:
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>Username to use against metastore database</description>
</property>
…其余不变…
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
<description>Whether to include the current database in the Hive prompt.</description>
</property>
…其余不变…
<property>
<name>hive.exec.local.scratchdir</name>
<value>/root/app/apache-hive-1.2.1-bin/iotmp</value>
<description>Local scratch space for Hive jobs</description>
</property>
…略…
<property>
<name>hive.querylog.location</name>
<value>/root/app/apache-hive-1.2.1-bin/iotmp</value>
<description>Location of Hive run time structured log file</description>
</property>
…略…
<property>
<name>hive.downloaded.resources.dir</name>
<value>/root/app/apache-hive-1.2.1-bin/iotmp</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
…略…
启动hive前必须先启动hadoop集群.
启动hive:
#hive
此时,hive只有一个数据库default;本机的mysql将立即创建数据库hive,并且创建若干张表.
kettle6.0操纵此hive
进一步配置hive-site.xml的Hive Server2属性:
<property>
<name>hive.server2.thrift.bind.host</name>
<value>hadoop002</value>
<description>Bind host on which to run the HiveServer2 Thrift service.</description>
</property>
其余属性用默认值(端口号默认值为10000).
编辑plugin.properties
编辑kettle60安装目录/plugins/pentaho-big-data-plugin/plugin.properties
只改一处: active.hadoop.configuration=hdp22
2020年5月14日02:18:38做到此步,关机.
附1:将hive启动为后台服务的方式:
# hive --service metastore &
# hive --service hiveserver2 &
附2:kettle版本对应hadoop平台问题:
kettle安装目录\plugins\pentaho-big-data-plugins\plugin.properties
里面的active.hadoop.configuration的赋值:
kettle 6.0版本对应hortworks2.2 配置写为hdp22
kettle 6.1版本对应hortworks2.3 配置写为hdp23
kettle 7.0版本对应hortworks2.4 配置写为hdp24
kettle 7.1版本对应hortworks2.5 配置写为hdp25