一、下载MapReduce的WordCount

要想了解MapReduce编程规范，直接看一下官方代码是怎么写的就知道了

打开shell工具，下载hadoop-mapreduce-examples-3.1.3.jar包，路径是：

/opt/module/hadoop-3.1.3/share/hadoop/mapreduce

然后下载：

sz hadoop-mapreduce-examples-3.1.3.jar

使用反编译工具查看jar包内容，点我下载反编译工具

打开反编译工具，把jar包拖进去，打开后是这样的（这里博主直接点到了wordcount代码块）：

二、常用数据序列化类型

看一下WordCount代码：

package org.apache.hadoop.examples;

//import部分省略

public class WordCount
{
  public static void main(String[] args) 
  	throws Exception
  {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length < 2) {
      System.err.println("Usage: wordcount <in> [<in>...] <out>");
      System.exit(2);
    }
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    for (int i = 0; i < otherArgs.length - 1; i++) {
      FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
    }
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[(otherArgs.length - 1)]));

    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }

  public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable>
  {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context)
      throws IOException, InterruptedException
    {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      this.result.set(sum);
      context.write(key, this.result);
    }
  }

  public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>
  {
    private static final IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException
    {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        this.word.set(itr.nextToken());
        context.write(this.word, one);
      }
    }
  }
}

从上面的代码中，我们可以看到有很多之前没有见过的数据类型，这些类型都是Hadoop自己的类型，下表总结了Java类型与Hadoop数据类型的对比：

可以发现除了String对应的是Text，其他的类型只不过是在最后加了关键字Writable，所以Hadoop的数据类型还是很好记忆与掌握的

三、MapReduce编程规范

从上面的案例代码中可以看到整个WordCount程序分为了三个部分，下面把他们的方法签名都抽取出来：

public static void main(String[] args)
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable>
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>

其中main对应的是Driver阶段；IntSumReducer对应的是Reduce阶段，继承了Reducer类；TokenizerMapper对应的是Map阶段，继承了Mapper类

可以看到继承的类后面跟了很多的泛型，接下来逐个击破！

1、Mapper阶段

用户自定义的Mapper要继承自己的父类，即继承了Mapper类
Mapper后面跟的泛型，前两个是一个k-v键值对（用户可自定义），对应的是输入数据
Mapper的输出数据也是一个K-V键值对，对应的是后面两个泛型
Mapper中的业务逻辑写在map()方法中，map()即MapTask进程方法对每一个k-v调用一次，看下图：

2、Reducer阶段

用户自定义的Reducer要继承自己的父类Reducer
Reducer的输入数据类型对应Mapper的输出数据类型，也是K-V键值对，如下图：
Reducer的业务逻辑写在reduce()方法中，ReduceTask进程对每一组相同的k的k-v组调用一次reduce()方法

3、Driver阶段

相当于YARN集群的客户端，用于提交整个程序到YARN集群，提交的是封装了MapReduce程序相关运行参数的job对象。后期详细解释

下一小节将以此编程规范编写WordCount程序！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

15、图解MapReduce编程规范.md

15、图解MapReduce编程规范.md

一、下载MapReduce的WordCount

二、常用数据序列化类型

三、MapReduce编程规范

1、Mapper阶段

2、Reducer阶段

3、Driver阶段

Files

15、图解MapReduce编程规范.md

Latest commit

History

15、图解MapReduce编程规范.md

File metadata and controls

一、下载MapReduce的WordCount

二、常用数据序列化类型

三、MapReduce编程规范

1、Mapper阶段

2、Reducer阶段

3、Driver阶段