Hadoop3.0版本安装、性能研究
Hadoop3.0安装:
环境:Ubuntu14.04 64位
1.adduser advhadoop添加用户和组
2.为hadoop用户添加权限
sudo gedit /etc/sudoers
3.安装ssh
sudo apt-get install openssh-server
安装完成后启动ssh server服务
sudo /etc/init.d/ssh start
查看ssh服务是否启动
ps -e | grep ssh
如果看到ssh等字样说明启动成功
4.设置免密码登录,生成私钥和公钥
ssh-****** -t rsa -P ""
将公钥追加到authorized_keys中,它用户保存所有允许以当前用户身份登录到ssh客户端用户的公钥内容。
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
5.登录ssh
ssh localhost
退出
exit
6.安装jdk1.8
下载jdk-8u131-linux-x64.tar.gz,解压到//usr/lib/jvm/jdk1.8.0_131文件夹下,设置~/.bash中的JAVA_HOME和PATH
update-alternatives --install /usr/bin/java java /usr/lib/jvm/jdk1.8.0_131/bin/java 300
update-alternatives --install /usr/bin/javac javac /usr/lib/jvm/jdk1.8.0_131/bin/javac 300
update-alternatives --config java
update-alternatives --config javac
7.下载hadoop-3.0.0-alpha1.tar.gz,解压到/usr/local/advhadoop
8.chmod +x hadoop-env.sh
./hadoop-env.sh
9.bin/hadoop将会显示hadoop脚本的使用文档。
10.伪分布式配置
使用如下的 etc/hadoop/core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
etc/hadoop/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
11.chmod 0600 ~/.ssh/authorized_keys
12.开启hadoop
格式化一个新的分布式文件系统: bin/hdfs namenode -format
启动namenode和datanode守护进程:sbin/start-dfs.sh
Hadoop守护进程的日志写入到 $HADOOP_LOG_DIR目录(默认是 $HADOOP_LOG_DIR
/logs).
浏览NameNode网络接口,它的地址默认为:
NameNode —— http://localhost:9870/
13.执行WordCount的例子
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/hduser
$ bin/hdfs dfs -mkdir /user/hduser/input
$ bin/hdfs dfs -put etc/hadoop/*.xml /user/hduser/input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-alpha1.jar grep /user/hduser/input output 'dfs[a-z.]+'
$ bin/hdfs dfs -get output output
$ cat output/*
$ bin/hdfs dfs -cat output/*
14.关闭hadoop sbin/stop-dfs.sh
参考资料:
hadoop3.0安装配置http://blog.****.net/sum__mer/article/details/52472420
Ubuntu 14.04下 Hadoop3.0-alpha的安装方法http://www.2cto.com/kf/201703/613371.html
终端输出:
[email protected]:/usr/local/advhadoop$bin/hdfs namenode -format
2017-05-02 00:00:53,219 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: user = advhadoop
STARTUP_MSG: host = happy-Lenovo-IdeaPad-Y480/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 3.0.0-alpha1
…...
2017-05-02 00:01:09,112 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2017-05-02 00:01:09,118 INFO util.ExitUtil: Exiting with status 0
2017-05-02 00:01:09,122 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at happy-Lenovo-IdeaPad-Y480/127.0.1.1
************************************************************/
[email protected]:/usr/local/advhadoop$ sbin/start-dfs.sh
Starting namenodes on [localhost]
localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Starting datanodes
localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Starting secondary namenodes [happy-Lenovo-IdeaPad-Y480]
happy-Lenovo-IdeaPad-Y480: Warning: Permanently added 'happy-lenovo-ideapad-y480' (ECDSA) to the list of known hosts.
[email protected]:/usr/local/advhadoop$bin/hdfs dfs -mkdir /user
[email protected]:/usr/local/advhadoop$bin/hdfs dfs -mkdir /user/hduser
[email protected]:/usr/local/advhadoop$bin/hdfs dfs -mkdir /user/hduser/input
[email protected]:/usr/local/advhadoop$bin/hdfs dfs -put etc/hadoop/*.xml /user/hduser/input
[email protected]:/usr/local/advhadoop$bin/hdfs dfs -put etc/hadoop/*.xml /user/hduser/input
[email protected]:/usr/local/advhadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-alpha1.jar grep /user/hduser/input output 'dfs[a-z.]+'
[email protected]:/usr/local/advhadoop$bin/hdfs dfs -get output output
[email protected]:/usr/local/advhadoop$cat output/*
cat: output/output: 是一个目录
[email protected]:/usr/local/advhadoop$cat output/output
cat: output/output: 是一个目录
[email protected]:/usr/local/advhadoop$cat output/output/*
1 dfsadmin
1 dfs.replication
问题:
1.需要安装jdk1.8,否则无法正常启动
2.重新格式化format后,提示namenode错误:
先停hadoop;
到hadoop.tmp.dir这里配置路径清除文件;(hadoop.tmp.dir默认:/tmp/hadoop-${user.name})
然后hadoop namenode -format;
最后重启hadoop
hadoop3.0性能研究
新特性:
1.Java最低版本要求java8,使用java7或者更低版本的需要升级到8。
2.HDFS支持纠编码erasure encoding,简称EC技术。EC技术可以防止数据丢失,又可以解决HDFS存储空间翻倍的问题。劣势是: 一旦数据需要恢复,会带来网络消耗,因为不仅要读取原数据块,还要读取校验块。 存储文件,或者恢复文件需要编码解码,会有CPU消耗。 建议EC存储用于冷数据,由于冷数据确实数量大,可以减少副本从而降低存储空间,另外冷数据稳定,一旦需要恢复数据,对业务不会有太大影响。
Hadoop common的变化
精简了内核,剔除了过期的API和实现,废弃hftp转由webhdfs替代。
Classpath isolation防止不同版本jar包冲突,例如google guava在混合使用hadoop、hbase、spark时出现冲突。mapreduce有参数控制忽略hadoop环境中的jar,而使用用户提交的第三方jar,但提交spark任务却不能解决此问题,需要在自己的jar包中重写需要的第三方类或者整个spark环境升级。classpath isolation用户可以很方便的选择自己需要的第三方依赖包。参见HADOOP-11656
hadoop shell脚本重构,修复了大量bug,增加了新特性,支持动态命令。
Hadoop namenode支持一个active,多个standby的部署方式。在hadoop-2.x中resourcemanager已经支持这个特性。
Mapreduce task-level native优化,mapreduce增加了map output collector的native实现,对于shuffle密集型任务,可以提高30%的执行效率。
内存参数自动推断。在Hadoop 2.0中,为MapReduce作业设置内存参数非常繁琐,涉及到两个参数:mapreduce.{map,reduce}.memory.mb和mapreduce.{map,reduce}.java.opts,一旦设置不合理,则会使得内存资源浪费严重,比如将前者设置为4096MB,但后者却是“-Xmx2g”,则剩余2g实际上无法让java heap使用到。
Hadoop Yarn,cgroup增加了内存和io disk的隔离,timeline service v2,YARN container resizing等等。
基准测试:
1.TestDFSIO
[email protected]:/usr/local/advhadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-alpha1-tests.jar
An example program must be given as the first argument.
Valid program names are:
DFSCIOTest: Distributed i/o benchmark of libhdfs.
DistributedFSCheck: Distributed checkup of the file system consistency.
JHLogAnalyzer: Job History Log analyzer.
MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures
NNdataGenerator: Generate the data to be used by NNloadGenerator
NNloadGenerator: Generate load on Namenode using NN loadgenerator run WITHOUT MR
NNloadGeneratorMR: Generate load on Namenode using NN loadgenerator run as MR job
NNstructureGenerator: Generate the structure to be used by NNdataGenerator
SliveTest: HDFS Stress Test and Live Data Verification.
TestDFSIO: Distributed i/o benchmark.
fail: a job that always fails
filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed)
largesorter: Large-Sort tester
loadgen: Generic map/reduce load generator
mapredtest: A map/reduce test check.
minicluster: Single process HDFS and MR cluster.
mrbench: A map/reduce benchmark that can create many small jobs
nnbench: A benchmark that stresses the namenode w/ MR.
nnbenchWithoutMR: A benchmark that stresses the namenode w/o MR.
sleep: A job that sleeps at each map and reduce task.
testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce
testfilesystem: A test for FileSystem read/write.
testmapredsort: A map/reduce program that validates the map-reduce framework's sort.
testsequencefile: A test for flat files of binary key value pairs.
testsequencefileinputformat: A test for sequence file input format.
testtextinputformat: A test for text input format.
threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill
timelineperformance: A job that launches mappers to test timline service performance.
[email protected]:/usr/local/advhadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-alpha1-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 10MB
[email protected]:/usr/local/advhadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-alpha1-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 10MB
[email protected]:/usr/local/advhadoop$ cat TestDFSIO_results.log
----- TestDFSIO ----- : write
Date & time: Tue May 02 11:50:08 CST 2017
Number of files: 10
Total MBytes processed: 100
Throughput mb/sec: 103.84
Total Throughput mb/sec: 0.03
Average IO rate mb/sec: 112.82
IO rate std deviation: 26.36
Test exec time sec: 3.97
----- TestDFSIO ----- : read
Date & time: Tue May 02 11:52:09 CST 2017
Number of files: 10
Total MBytes processed: 100
Throughput mb/sec: 546.45
Total Throughput mb/sec: 0.04
Average IO rate mb/sec: 647.18
IO rate std deviation: 257.8
Test exec time sec: 2.82
[email protected]:/usr/local/advhadoop$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-alpha1-tests.jar TestDFSIO -clean
2017-05-02 18:25:01,647 INFO fs.TestDFSIO: TestDFSIO.1.8
2017-05-02 18:25:01,652 INFO fs.TestDFSIO: nrFiles = 1
2017-05-02 18:25:01,652 INFO fs.TestDFSIO: nrBytes (MB) = 1.0
2017-05-02 18:25:01,652 INFO fs.TestDFSIO: bufferSize = 1000000
2017-05-02 18:25:01,652 INFO fs.TestDFSIO: baseDir = /benchmarks/TestDFSIO
2017-05-02 18:25:02,500 INFO fs.TestDFSIO: Cleaning up test files
2. TeraSort
一个完整的TeraSort测试需要按以下三步执行:
用TeraGen生成随机数据对输入数据运行TeraSort用TeraValidate验证排好序的输出数据并不需要在每次测试时都生成输入数据,生成一次数据之后,每次测试可以跳过第一步。
TeraGen的用法如下:
hadoop jar hadoop-*examples*.jar teragen <number of 100-byte rows> <output dir>
以下命令运行TeraGen生成1GB的输入数据,并输出到目录/examples/terasort-input:
[email protected]:/usr/local/advhadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-alpha1.jar teragen 10000000 /examples/terasort-input
以下命令运行TeraSort对数据进行排序,并将结果输出到目录/examples/terasort-output:
[email protected]:/usr/local/advhadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-alpha1.jar terasort /examples/terasort-input /examples/terasort-output
以下命令运行TeraValidate来验证TeraSort输出的数据是否有序,如果检测到问题,将乱序的key输出到目录/examples/terasort-validate:
[email protected]:/usr/local/advhadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-alpha1.jar terasort /examples/terasort-output /examples/terasort-validate
[email protected]:/usr/local/advhadoop$bin/hadoop fs -count /examples/terasort-validate
1 3 1000000000 /examples/terasort-validate
3.nnbench
nnbench用于测试NameNode的负载,它会生成很多与HDFS相关的请求,给NameNode施加较大的压力。这个测试能在HDFS上模拟创建、读取、重命名和删除文件等操作。nnbench的用法如下:
以下例子使用12个mapper和6个reducer来创建1000个文件:
[email protected]:/usr/local/advhadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-alpha1-tests.jar nnbench -operation create_write -maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 -replicationFactorPerFile 3 -readFileAfterOpen true -baseDir /bench/NNBench-'hostname -s'
…...
DataLines Maps Reduces AvgTime (milliseconds)
1 2 1 1124
以上结果表示平均作业完成时间是11秒。
参考资料:
http://blog.****.net/flygoa/article/details/52127382
http://www.aixchina.net/Question/177983
对比hadooop2.7.1
1. TestDFSIO
[email protected]:/usr/local/hadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.1-tests.jar
[email protected]:/usr/local/hadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.1-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 10MB
[email protected]:/usr/local/hadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.1-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 10MB
[email protected]:/usr/local/hadoop$cat TestDFSIO_results.log
----- TestDFSIO ----- : write
Date & time: Wed May 03 00:14:30 CST 2017
Number of files: 10
Total MBytes processed: 100.0
Throughput mb/sec: 121.50668286755771
Average IO rate mb/sec: 128.30081176757812
IO rate std deviation: 23.2361211216607
Test exec time sec: 4.6
----- TestDFSIO ----- : read
Date & time: Wed May 03 00:15:21 CST 2017
Number of files: 10
Total MBytes processed: 100.0
Throughput mb/sec: 257.0694087403599
Average IO rate mb/sec: 282.7495422363281
IO rate std deviation: 65.24276389107759
Test exec time sec: 2.828
[email protected]:/usr/local/hadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.1-tests.jar TestDFSIO -clean
2.TeraSort
[email protected]:/usr/local/hadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar teragen 10000000 /examples/terasort-input
[email protected]:/usr/local/hadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar terasort /examples/terasort-input /examples/terasort-output
[email protected]:/usr/local/hadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar terasort /examples/terasort-output /examples/terasort-validate
[email protected]:/usr/local/hadoop$bin/hadoop fs -count /examples/terasort-validate
1 3 1000000000 /examples/terasort-validate
3.nnbench
[email protected]:/usr/local/hadoop$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.1-tests.jar nnbench -operation create_write -maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 -replicationFactorPerFile 3 -readFileAfterOpen true -baseDir /bench/NNBench-'hostname -s'
…...
17/05/03 00:32:40 INFO hdfs.NNBench: -------------- NNBench -------------- :
17/05/03 00:32:40 INFO hdfs.NNBench: Version: NameNode Benchmark 0.4
17/05/03 00:32:40 INFO hdfs.NNBench: Date & time: 2017-05-03 00:32:40,41
17/05/03 00:32:40 INFO hdfs.NNBench:
17/05/03 00:32:40 INFO hdfs.NNBench: Test Operation: create_write
17/05/03 00:32:40 INFO hdfs.NNBench: Start time: 2017-05-03 00:32:34,942
17/05/03 00:32:40 INFO hdfs.NNBench: Maps to run: 12
17/05/03 00:32:40 INFO hdfs.NNBench: Reduces to run: 6
17/05/03 00:32:40 INFO hdfs.NNBench: Block Size (bytes): 1
17/05/03 00:32:40 INFO hdfs.NNBench: Bytes to write: 0
17/05/03 00:32:40 INFO hdfs.NNBench: Bytes per checksum: 1
17/05/03 00:32:40 INFO hdfs.NNBench: Number of files: 1000
以上结果表示平均作业完成时间是14秒。
可以看出hadoop3.0相比hadoop2.7.1在TestDFSIO中文件读写速度方面有不少的提升,在TeraSort的结果没有差别,在nnbench中测试NameNode的负载也有明显提升。
通过安装hadoop3.0,了解到与hadoop2.7.1的安装过程基本类似;在性能方面,学习hadoop基准测试的方法,总体来看hadoop3.0有不少的提升;另外两者的NameNode的端口地址不同。