WordCountやっと動いた
Hadoop Tutorialでよく見かけるWordCountを試してみました.
なかなか動かず,あれこれ修正してようやく何とかなりました.
Hadoopのバージョンによって,継承すべきクラスとか引数の型が異なるみたい.
MapperとReducer
今回は,バージョン0.20.2+737で動作させました.
$ hadoop version Hadoop 0.20.2+737 Subversion -r 98c55c28258aa6f42250569bd7fa431ac657bdbd Compiled by root on Mon Oct 11 13:14:05 EDT 2010 From source with checksum d13991fbc138e18f3b7eb8f60ee708dd
ソースはこんな感じ.
package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setJarByClass(Map.class); job.setJarByClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
build.xml
Tutorialではjavacを叩いてコンパイルしてますが,どうにもメンドいのでAntにやらせることにしました.
でも,Ant入れるとOpenJDKも入ったりして,ちょっと気持ち悪い...
yum install ant
ついでに,build.xml中でメインクラスも指定しています.
<project name="WordCount" default="all" basedir="."> <target name="all" depends="init, compile, archive" /> <target name="init"> <mkdir dir="classes" /> </target> <target name="compile"> <javac srcdir="org/myorg" destdir="classes" classpath="/usr/lib/hadoop-0.20/hadoop-core-0.20.2+737.jar" /> </target> <target name="archive"> <jar jarfile="word-count.jar" basedir="classes"> <manifest> <attribute name="Main-Class" value="org.myorg.WordCount"/> </manifest> </jar> </target> <target name="clean"> <delete dir="classes" /> </target> </project>
事前準備
コンパイルしてJARファイルまでできたら,入力となるテキストファイルを用意します.
$ echo "Hello World Bye World" > file01.txt $ echo "Hello Hadoop Goodbye Hadoop" > file02.txt $ hadoop dfs -mkdir /input/word_count $ hadoop dfs -copyFromLocal file01.txt /input/word_count $ hadoop dfs -copyFromLocal file02.txt /input/word_count
実行&結果確認
$ export HADOOP_CLASSPATH=. $ hadoop jar word-count.jar /input/word_count /user/user/output $ hadoop fs -cat /user/user/output/part-r-00000 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2