基于CentOS6的大数据开发环境部署-hadoop2.7-hive1.2-kettle6.0-spark2.3.4(一)

教材1:

《Hadoop构建数据仓库实践》

作  者:王雪迎 著

定  价:89

出 版 社:清华大学出版社

出版日期:2017年07月01日

页  数:434

装  帧:平装

ISBN:9787302469803

教材2:

《Spark大数据分析与实战(大数据技术与应用丛书)》

商品参数

49.00

出版社:清华大学出版社

版次1-1

出版时间2019-08-26

开本16

作者:黑马程序员

页数228

ISBN编码9787302534327

 

此集群的原有配置:

3台32位Centos6.10,32位jdk1.8.0_211,

hadoop002有:eclipse,idea,mysql5.6,hive-1.1.0-cdh5.7.0, scala-2.11.11

zookeeper-3.4.5-cdh5.7.0集群,

HA方式部署的hadoop-2.6.0-cdh5.7.0和hbase-1.2.0-cdh5.7.0.

以下步骤若无其它说明,则都是以root身份登录hadoop002节点后做的.

,改建为Hadoop普通分布式集群:

    改建原因:实践《Hadoop构建数据仓库实践》需用apache hadoop2.7.2全分布式(非HA)集群.

删除且仅删除各节点的hadoop-2.6.0-cdh5.7.0的安装目录;

解压Hadoop2.7.2到/root/app/hadoop-2.7.2 ;

chown和chgrp命令更改hadoop-2.7.2目录的属主和组为root;

创建目录:

# mkdir  /root/app/hadoop-2.7.2/tmp

# mkdir /root/app/hadoop-2.7.2/dataDir

# mkdir /root/app/hadoop-2.7.2/nameDir

编写hadoop-env.sh

#gedit /root/app/hadoop-2.7.2/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/root/app/jdk1.8.0_211

编写core-site.xml

#gedit /root/app/hadoop-2.7.2/etc/hadoop/core-site.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

        <name>fs.defaultFS</name>

        <value>hdfs://hadoop002:9000</value>

  </property>

  <property>

        <name>hadoop.tmp.dir</name>

        <value>/root/app/hadoop-2.7.2/tmp</value>

  </property>

</configuration>

 

编写hdfs-site.xml

#gedit /root/app/hadoop-2.7.2/etc/hadoop/hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

    <name>dfs.datanode.data.dir</name>

    <value>/root/app/hadoop-2.7.2/dataDir</value>

  </property>

  <property>

    <name>dfs.namenode.name.dir</name>

    <value>/root/app/hadoop-2.7.2/nameDir</value>

  </property>

  <property>

                  <name>dfs.namenode.http-address</name>

                <value>hadoop002:50070</value>

        </property>

        <property>

                <name>dfs.namenode.secondary.http-address</name>

                <value>hadoop002:50090</value>

        </property>

        <property>

                <name>dfs.replication</name>

                <value>2</value>

        </property>

</configuration>

 

编辑mapred-site.xml

<?xml version="1.0"?>

<configuration>

  <property>

    <name>mapreduce.framework.name</name> 

    <value>yarn</value>

  </property> 

</configuration>

 

编辑yarn-site.xml

<?xml version="1.0"?>

<configuration>

  <property>

    <name>yarn.resourcemanager.hostname</name> 

    <value>hadoop002</value>

  </property> 

  <property>

    <name>yarn.nodemanager.aux-services</name> 

    <value>mapreduce_shuffle</value>

  </property> 

  <property>

    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> 

    <value>org.apache.hadoop.mapred.ShuffleHandler</value>

  </property>

</configuration>

复制到其它节点:

# cd  /root/app

# scp -r hadoop-2.7.2 hadoop001:/root/app

# scp -r hadoop-2.7.2 hadoop003:/root/app

格式化HDFS:

# hdfs namenode -format

输出信息为:

…略…

20/05/07 16:29:14 INFO common.Storage: Storage directory /root/app/hadoop-2.7.2/nameDir has been successfully formatted.

…略…

无异常信息,说明格式化成功!

启动集群,然后查看全部节点的jvm上的全部进程:

# start-all.sh

# alljps.sh

hadoop001:

1989 DataNode

2214 Jps

2090 NodeManager

hadoop002:

3492 ResourceManager

3046 NameNode

3174 DataNode

3912 Jps

3595 NodeManager

3341 SecondaryNameNode

hadoop003:

1943 DataNode

2168 Jps

2044 NodeManager

上传一个文件到hdfs的根目录

# hadoop dfs -put necklace.txt /

# hadoop dfs -ls /

…略…

-rw-r--r--   2 root supergroup      16761 2020-05-07 16:56 /necklace.txt

二,用Kettle60操纵hadoop集群和zookeeper集群:

1,复制和解压kettle6.0到/root/app/kettle60

# unzip pdi-ce-6.0.1.0-386.zip -d /root/app/kettle60

复制hadoop集群的hdfs-site.xml和core-site.xml到Kettle的特定目录下:

#cd /root/app/kettle60/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh54

# cp /root/app/hadoop-2.7.2/etc/hadoop/hdfs-site.xml .

# cp  /root/app/hadoop-2.7.2/etc/hadoop/core-site.xml .

回答yes

2,在hdfs上创建当前使用kettle的用户对应的目录.

# hadoop dfs -mkdir /user

# hadoop dfs -mkdir /user/root

3,编辑config.properties

# gedit config.properties

在末尾添加一行:

authentication.superuser.provider=NO_AUTH

 

4,修改kettle6.0的启动脚本spoon.sh

# cd /root/app/kettle60/data-integration

# gedit spoon.sh

1),rm -rf /root/app/kettle60/data-integration/system/karaf/data1作为该文件第2.

 (被删除的目录的用途需要后续再研究).

2),找到”OPT=”(大概在201行),添加红色字符部分(双引号内不换行,短横线前要有空格)

OPT="$OPT $PENTAHO_DI_JAVA_OPTIONS -Dhttps.protocols=TLSv1,TLSv1.1,TLSv1.2 -Djava.library.path=$LIBPATH -DKETTLE_HOME=$KETTLE_HOME -DKETTLE_REPOSITORY=$KETTLE_REPOSITORY -DKETTLE_USER=$KETTLE_USER -DKETTLE_PASSWORD=$KETTLE_PASSWORD -DKETTLE_PLUGIN_PACKAGES=$KETTLE_PLUGIN_PACKAGES -DKETTLE_LOG_SIZE_LIMIT=$KETTLE_LOG_SIZE_LIMIT -DKETTLE_JNDI_ROOT=$KETTLE_JNDI_ROOT -Dorg.eclipse.swt.browser.DefaultType=mozilla"

5,启动Kettle:

(可将Kettle6.0的安装目录设置到$PATH中)

# spoon.sh

会弹出网页,点击网页底部的”Tutorials & Videos”后可见学习指南.

最小化网页,即可使用Kettle窗口.

点击”主对象树”:

基于CentOS6的大数据开发环境部署-hadoop2.7-hive1.2-kettle6.0-spark2.3.4(一)

6,连接apache-hive-1.2.1

见本文档的Hive1.2.1的部署说明部分.

7,连接zookeeper全分布式集群

1),启动zookeeper集群

#startzk.sh

2),kettle内设置ZooKeeper段的Hostname为hadoop002,Port为2181,即可连接成功.

 

 

三,安装redis2.8.6

参考:《NoSQL数据库技术实战》Page.195.

安装gcc

# yum install -y gcc

下载解压redis源代码

#cd  /root/app/redis-2.8.6

# wget http://download.redis.io/releases/redis-2.8.6.tar.gz

# tar -zxvf redis-2.8.6.tar.gz

# cd redis-2.8.6

编译redis

# make

将redis启动程序复制到$PATH

# make install

编辑配置文件

# gedit /etc/redis286.conf

daemonize yes

pidfile /root/app/redis-2.8.6/redis286.pid

port 6379

timeout 300

loglevel debug

logfile /root/app/redis-2.8.6/redis286.log

databases 16

save 900 1

save 300 10

save 60 10000

rdbcompression yes

dbfilename redis286dump.rdb

dir /root/app/redis-2.8.6

appendonly no

appendfsync always

启动redis为后台进程

# redis-server /etc/redis286.conf

查看redis服务进程

# ps -ef |grep redis|grep -v grep

root      6015     1  0 11:54 ?        00:00:00 redis-server *:6379

 

四,apache-spark2.3.4全分布式部署

使用的安装程序: spark-2.3.4-bin-hadoop2.7.tgz   

部署步骤:见《Spark大数据分析与实战(大数据技术与应用丛书)》p.39.

   2020年5月14日星期四凌晨部署成功.

基于CentOS6的大数据开发环境部署-hadoop2.7-hive1.2-kettle6.0-spark2.3.4(一)

五,部署apache-Hive-1.2.1

1,安装MySql5.6

hadoop002已经安装过MySQL 5.6.45 Community版(mysql-connector-java-5.1.45,用户root的登录密码为123456),所以,直接安装Hive.

(原安装大致过程:

# wget http://repo.mysql.com/mysql-community-release-el6-5.noarch.rpm

# rpm -ivh mysql-community-release-el6-5.noarch.rpm

# yum install mysql-community-server -y

该mysql的配置文件为: /etc/my.cnf)

# ll /etc/my.cnf

-rw-r--r-- 1 root root 1066 6月  10 2019 /etc/my.cnf

原始内容:

# For advice on how to change settings please see

# http://dev.mysql.com/doc/refman/5.6/en/server-configuration-defaults.html

 

[mysqld]

#

# Remove leading # and set to the amount of RAM for the most important data

# cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%.

# innodb_buffer_pool_size = 128M

#

# Remove leading # to turn on a very important data integrity option: logging

# changes to the binary log between backups.

# log_bin

#

# Remove leading # to set options mainly useful for reporting servers.

# The server defaults are faster for transactions and fast SELECTs.

# Adjust sizes as needed, experiment to find the optimal values.

# join_buffer_size = 128M

# sort_buffer_size = 2M

# read_rnd_buffer_size = 2M

datadir=/var/lib/mysql

socket=/var/lib/mysql/mysql.sock

 

# Disabling symbolic-links is recommended to prevent assorted security risks

symbolic-links=0

 

# Recommended in standard MySQL setup

sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES

 

[mysqld_safe]

log-error=/var/log/mysqld.log

pid-file=/var/run/mysqld/mysqld.pid

)

由于hadoop002上原先部署的hive-1.1.0-cdh5.7.0使用该mysql存储元数据,已经有数据库hive,所以,删除旧数据库hive:

# mysql -uroot -p

mysql> drop database hive;

2,安装Hive1.2

# tar -zxvf hive-1.2.1-bin.tar.gz -C ~/app/

#cd  /root/app/apache-hive-1.2.1-bin

# mkdir iotmp

# cd conf

# cp hive-default.xml.template hive-site.xml

# gedit hive-site.xml

…其余不变…

大概在第362行:

<property>

    <name>javax.jdo.option.ConnectionPassword</name>

    <value>123456</value>

    <description>password to use against metastore database</description>

  </property>

大概在第378行:

<property>

    <name>javax.jdo.option.ConnectionURL</name>

    <value>jdbc:mysql://127.0.0.1:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false</value>

    <description>JDBC connect string for a JDBC metastore</description>

  </property>

…其余不变…

大概在第772行:

<property>

    <name>javax.jdo.option.ConnectionDriverName</name>

    <value>com.mysql.jdbc.Driver</value>

    <description>Driver class name for a JDBC metastore</description>

  </property>

…其余不变…

大概在第797行:

<property>

    <name>javax.jdo.option.ConnectionUserName</name>

    <value>root</value>

    <description>Username to use against metastore database</description>

  </property>

…其余不变…

<property>

    <name>hive.cli.print.current.db</name>

    <value>true</value>

    <description>Whether to include the current database in the Hive prompt.</description>

</property>

…其余不变…

<property>

  <name>hive.exec.local.scratchdir</name>

    <value>/root/app/apache-hive-1.2.1-bin/iotmp</value>

    <description>Local scratch space for Hive jobs</description>

</property>

…略…

<property>

   <name>hive.querylog.location</name>

    <value>/root/app/apache-hive-1.2.1-bin/iotmp</value>

    <description>Location of Hive run time structured log file</description>

 </property>

…略…

 <property>

   <name>hive.downloaded.resources.dir</name>

    <value>/root/app/apache-hive-1.2.1-bin/iotmp</value>

    <description>Temporary local directory for added resources in the remote file system.</description>

 </property>

…略…

启动hive前必须先启动hadoop集群.

启动hive:

#hive

此时,hive只有一个数据库default;本机的mysql将立即创建数据库hive,并且创建若干张表.

kettle6.0操纵此hive

进一步配置hive-site.xml的Hive Server2属性:

<property>

    <name>hive.server2.thrift.bind.host</name>

    <value>hadoop002</value>

    <description>Bind host on which to run the HiveServer2 Thrift service.</description>

  </property>

其余属性用默认值(端口号默认值为10000).

编辑plugin.properties

编辑kettle60安装目录/plugins/pentaho-big-data-plugin/plugin.properties

只改一处: active.hadoop.configuration=hdp22

基于CentOS6的大数据开发环境部署-hadoop2.7-hive1.2-kettle6.0-spark2.3.4(一)

202051402:18:38做到此步,关机.

附1:将hive启动为后台服务的方式:

# hive --service metastore &

# hive --service hiveserver2 &

 

附2:kettle版本对应hadoop平台问题:

kettle安装目录\plugins\pentaho-big-data-plugins\plugin.properties

里面的active.hadoop.configuration的赋值:

kettle 6.0版本对应hortworks2.2   配置写为hdp22

kettle 6.1版本对应hortworks2.3   配置写为hdp23

kettle 7.0版本对应hortworks2.4   配置写为hdp24

kettle 7.1版本对应hortworks2.5   配置写为hdp25