WordCountやっと動いた

Hadoop Tutorialでよく見かけるWordCountを試してみました.
なかなか動かず,あれこれ修正してようやく何とかなりました.
Hadoopのバージョンによって,継承すべきクラスとか引数の型が異なるみたい.

MapperとReducer

今回は,バージョン0.20.2+737で動作させました.

$ hadoop version
Hadoop 0.20.2+737
Subversion  -r 98c55c28258aa6f42250569bd7fa431ac657bdbd
Compiled by root on Mon Oct 11 13:14:05 EDT 2010
From source with checksum d13991fbc138e18f3b7eb8f60ee708dd


ソースはこんな感じ.

package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
        
public class WordCount {
    public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
	private final static IntWritable one = new IntWritable(1);
	private Text word = new Text();

	public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
	    String line = value.toString();
	    StringTokenizer tokenizer = new StringTokenizer(line);
	    while (tokenizer.hasMoreTokens()) {
		word.set(tokenizer.nextToken());
		context.write(word, one);
	    }
	}
    }

    public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
	public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
	    int sum = 0;
	    for (IntWritable value : values) {
		sum += value.get();
	    }
	    context.write(key, new IntWritable(sum));
	}
    }

    public static void main(String[] args) throws Exception {
	Configuration conf = new Configuration();
        Job job = new Job(conf, "wordcount");

	job.setOutputKeyClass(Text.class);
	job.setOutputValueClass(IntWritable.class);

	job.setMapperClass(Map.class);
	job.setReducerClass(Reduce.class);
	job.setJarByClass(Map.class);
	job.setJarByClass(Reduce.class);

	job.setInputFormatClass(TextInputFormat.class);
	job.setOutputFormatClass(TextOutputFormat.class);

	FileInputFormat.addInputPath(job, new Path(args[0]));
	FileOutputFormat.setOutputPath(job, new Path(args[1]));

	job.waitForCompletion(true);
    }
}

build.xml

Tutorialではjavacを叩いてコンパイルしてますが,どうにもメンドいのでAntにやらせることにしました.
でも,Ant入れるとOpenJDKも入ったりして,ちょっと気持ち悪い...

yum install ant


ついでに,build.xml中でメインクラスも指定しています.

<project name="WordCount" default="all" basedir=".">
  <target name="all" depends="init, compile, archive" />

  <target name="init">
    <mkdir dir="classes" />
  </target>

  <target name="compile">
    <javac srcdir="org/myorg"
	   destdir="classes"
	   classpath="/usr/lib/hadoop-0.20/hadoop-core-0.20.2+737.jar"
	   />
  </target>

  <target name="archive">
    <jar jarfile="word-count.jar" basedir="classes">
      <manifest>
	<attribute name="Main-Class" value="org.myorg.WordCount"/>
      </manifest>
    </jar>
  </target>

  <target name="clean">
    <delete dir="classes" />
  </target>
</project>

事前準備

コンパイルしてJARファイルまでできたら,入力となるテキストファイルを用意します.

$ echo "Hello World Bye World" > file01.txt
$ echo "Hello Hadoop Goodbye Hadoop" > file02.txt
$ hadoop dfs -mkdir /input/word_count
$ hadoop dfs -copyFromLocal file01.txt /input/word_count
$ hadoop dfs -copyFromLocal file02.txt /input/word_count

実行&結果確認

$ export HADOOP_CLASSPATH=.
$ hadoop jar word-count.jar /input/word_count /user/user/output
$ hadoop fs -cat /user/user/output/part-r-00000
Bye	1
Goodbye	1
Hadoop	2
Hello	2
World	2