Maven工程的MapReduce程序1---实现WordCount功能
前提条件:
1. 安装好jdk1.8(Windows环境下)
2. 安装好Maven3.3.9(Windows环境下)
3. 安装好eclipse(Windows环境下)
4. 安装好hadoop(Linux环境下)
配置eclipse的Maven:(如果之前配置过,请跳过此步骤)
打开eclipse-->Window-->Preferences
新建Maven工程:
File-->New-->Project
输入Maven,点击选中Maven Project,点击Next
等待一会后,Maven工程构建完成后,会生成如下图的目录结构
修改pom.xml文件
a.配置主类:在</project>之前添加如下内容
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.1.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<!-- main()所在的类,注意修改 -->
<mainClass>com.MyWordCount.WordCountMain</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
修改截图如下:
b.配置MapReduce程序的依赖(重要):程序所依赖的包,在这里配置,Maven会帮我们下载下来,省去了手动添加jar包的麻烦。 在</dependencies>之前添加如下内容:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.7.3</version>
</dependency>
修改完成后,记得ctrl + s保存
刷新工程
删除原来的App.java文件
新建java类
新建WordCountMain.java之后,出现如下图界面:
接着再新建两个类:WordCountMapper.java和WordCountReducer.java
新建ok后,效果如下图:
编写代码:
WordCountMapper.java
package com.MyWordCount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
// 泛型 k1 v1 k2 v2
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
@Override
protected void map(LongWritable key1, Text value1, Context context)
throws IOException, InterruptedException {
//数据: I like MapReduce
String data = value1.toString();
//分词:按空格来分词
String[] words = data.split(" ");
//输出 k2 v2
for(String w:words){
context.write(new Text(w), new IntWritable(1));
}
}
}
WordCountReducer.java
package com.MyWordCount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
// k3 v3 k4 v4
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text k3, Iterable<IntWritable> v3,Context context) throws IOException, InterruptedException {
//对v3求和
int total = 0;
for(IntWritable v:v3){
total += v.get();
}
//输出 k4 单词 v4 频率
context.write(k3, new IntWritable(total));
}
}
WordCountMain.java
package com.MyWordCount;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountMain {
public static void main(String[] args) throws Exception {
//1.创建一个job和任务入口
Job job = Job.getInstance(new Configuration());
job.setJarByClass(WordCountMain.class); //main方法所在的class
//2.指定job的mapper和输出的类型<k2 v2>
job.setMapperClass(WordCountMapper.class);//指定Mapper类
job.setMapOutputKeyClass(Text.class); //k2的类型
job.setMapOutputValueClass(IntWritable.class); //v2的类型
//3.指定job的reducer和输出的类型<k4 v4>
job.setReducerClass(WordCountReducer.class);//指定Reducer类
job.setOutputKeyClass(Text.class); //k4的类型
job.setOutputValueClass(IntWritable.class); //v4的类型
//4.指定job的输入和输出
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//5.执行job
job.waitForCompletion(true);
}
}
查看工程所在目录:
Maven命令打包
Win + R, 输入cmd,确定
执行mvn打包命令: mvn clean package
注意:要保持网络畅通,因为项目打包时会下载一些文件。
上传生成的jar文件到装有hadoop的Linux机器上,这里使用WinScp工具上传
在linux ~目录下新建一个文件 a.txt, 文件内容如下:
将a.txt上传到hdfs 根目录下:$ hdfs dfs -put a.txt /
执行MapReduce任务:
$ hadoop jar MyWordCount-0.0.1-SNAPSHOT.jar /a.txt /output
执行过程的输出内容如下:
查看输出结果:
$ hdfs dfs -cat /output/part-r-00000
程序实现了统计HDFS a.txt文件单词出现的次数。
完成!enjoy it!