hadoop-streaming调用Python脚本解析ua
1.从环境中找到hadoop-streaming-2.3.0-mr1-cdh5.1.2.jar的包
[[email protected] python]$ cd $HADOOP_HOME && find ./ -name "*streaming*"
./share/doc/hadoop-streaming
./share/doc/hadoop-mapreduce1/streaming.pdf
./share/doc/hadoop-mapreduce1/streaming.html
./share/doc/hadoop-mapreduce1/api/org/apache/hadoop/streaming
./share/doc/api/org/apache/hadoop/streaming
./share/hadoop/mapreduce1/contrib/streaming
./share/hadoop/mapreduce1/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.1.2.jar
./share/hadoop/tools/lib/hadoop-streaming-2.3.0-cdh5.1.2.jar
./share/hadoop/tools/sources/hadoop-streaming-2.3.0-cdh5.1.2-sources.jar
./share/hadoop/tools/sources/hadoop-streaming-2.3.0-cdh5.1.2-test-sources.jar
./cloudera/patches/0051-MR1-CLOUDERA-BUILD.-hadoop-streaming-has-wrong-versi.patch
./cloudera/patches/0092-MR1-CLOUDERA-BUILD.-Publish-hadoop-streaming-jar.patch
./src/hadoop-mapreduce1-project/ivy/hadoop-streaming-pom-template.xml
./src/hadoop-mapreduce1-project/cloudera/maven-packaging/hadoop-streaming
./src/hadoop-mapreduce1-project/src/contrib/streaming
./src/hadoop-mapreduce1-project/src/contrib/streaming/src/test/org/apache/hadoop/streaming
./src/hadoop-mapreduce1-project/src/contrib/streaming/src/java/org/apache/hadoop/streaming
./src/hadoop-mapreduce1-project/src/docs/src/documentation/content/xdocs/streaming.xml
./src/hadoop-tools/hadoop-streaming
./src/hadoop-tools/hadoop-streaming/src/test/java/org/apache/hadoop/streaming
./src/hadoop-tools/hadoop-streaming/src/main/java/org/apache/hadoop/streaming
2.目录包含三部分,jar包,python脚本和hadoop-streaming启动脚本
3.启动脚本,运行hadoop-streaming
[[email protected] python-ua]$ cat only-ua-shell.sh
hadoop jar hadoop-streaming-2.3.0-mr1-cdh5.1.2.jar \
-D mapred.job.name="ua_parse" \
-D mapred.map.tasks=1500 \
-files ua_180523_test_mr.py,rewrite_ua_parser.py,MANUFACTURER.py \
-mapper ua_180523_test_mr.py \
-input $1 \
-output $2 \
-numReduceTasks 1500