Hadoop提供了MapReduce编程框架,其并行处理能力的发挥需要通过开发Map及Reduce程序实现。为了便于系统测试,Hadoop提供了一个单词统计的应用程序算法样例,其位于Hadoop安装目录下名称类似hadoop-examples-*.jar的文件中。除了单词统计,这个jar文件还包含了分布式运行的grep等功能的实现,这可以通过如下命令查看。
注:rpm包安装后,其算法样例位于/usr/share/hadoop/hadoop-examples-1.2.1.jar
[hadoop@master ~]$ hadoop jar /usr/share/hadoop/hadoop-examples-1.2.1.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
dbcount: An example job that count the pageview counts from a database.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using monte-carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sleep: A job that sleeps at each map and reduce task.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
首先创建in文件夹,put两个文件进去,然后进行测试
[hadoop@master ~]$ hadoop fs -mkdir in
[hadoop@master ~]$ hadoop fs -put /etc/fstab /etc/profile in
测试:
[hadoop@master ~]$ hadoop jar /usr/share/hadoop/hadoop-examples-1.2.1.jar wordcount in out
14/03/06 11:26:42 INFO input.FileInputFormat: Total input paths to process : 2
14/03/06 11:26:42 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/03/06 11:26:42 WARN snappy.LoadSnappy: Snappy native library not loaded
14/03/06 11:26:43 INFO mapred.JobClient: Running job: job_201403061123_0001
14/03/06 11:26:44 INFO mapred.JobClient: map 0% reduce 0%
14/03/06 11:26:50 INFO mapred.JobClient: map 100% reduce 0%
14/03/06 11:26:57 INFO mapred.JobClient: map 100% reduce 33%
14/03/06 11:26:58 INFO mapred.JobClient: map 100% reduce 100%
14/03/06 11:26:59 INFO mapred.JobClient: Job complete: job_201403061123_0001
14/03/06 11:26:59 INFO mapred.JobClient: Counters: 29
14/03/06 11:26:59 INFO mapred.JobClient: Job Counters
14/03/06 11:26:59 INFO mapred.JobClient: Launched reduce tasks=1
14/03/06 11:26:59 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=7329
14/03/06 11:26:59 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/03/06 11:26:59 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/03/06 11:26:59 INFO mapred.JobClient: Launched map tasks=2
14/03/06 11:26:59 INFO mapred.JobClient: Data-local map tasks=2
14/03/06 11:26:59 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8587
14/03/06 11:26:59 INFO mapred.JobClient: File Output Format Counters
14/03/06 11:26:59 INFO mapred.JobClient: Bytes Written=2076
14/03/06 11:26:59 INFO mapred.JobClient: FileSystemCounters
14/03/06 11:26:59 INFO mapred.JobClient: FILE_BYTES_READ=2948
14/03/06 11:26:59 INFO mapred.JobClient: HDFS_BYTES_READ=3139
14/03/06 11:26:59 INFO mapred.JobClient: FILE_BYTES_WRITTEN=167810
14/03/06 11:26:59 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2076
14/03/06 11:26:59 INFO mapred.JobClient: File Input Format Counters
14/03/06 11:26:59 INFO mapred.JobClient: Bytes Read=2901
14/03/06 11:26:59 INFO mapred.JobClient: Map-Reduce Framework
14/03/06 11:26:59 INFO mapred.JobClient: Map output materialized bytes=2954
14/03/06 11:26:59 INFO mapred.JobClient: Map input records=97
14/03/06 11:26:59 INFO mapred.JobClient: Reduce shuffle bytes=2954
14/03/06 11:26:59 INFO mapred.JobClient: Spilled Records=426
14/03/06 11:26:59 INFO mapred.JobClient: Map output bytes=3717
14/03/06 11:26:59 INFO mapred.JobClient: Total committed heap usage (bytes)=336994304
14/03/06 11:26:59 INFO mapred.JobClient: CPU time spent (ms)=2090
14/03/06 11:26:59 INFO mapred.JobClient: Combine input records=360
14/03/06 11:26:59 INFO mapred.JobClient: SPLIT_RAW_BYTES=238
14/03/06 11:26:59 INFO mapred.JobClient: Reduce input records=213
14/03/06 11:26:59 INFO mapred.JobClient: Reduce input groups=210
14/03/06 11:26:59 INFO mapred.JobClient: Combine output records=213
14/03/06 11:26:59 INFO mapred.JobClient: Physical memory (bytes) snapshot=331116544
14/03/06 11:26:59 INFO mapred.JobClient: Reduce output records=210
14/03/06 11:26:59 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3730141184
14/03/06 11:26:59 INFO mapred.JobClient: Map output records=360
注:这里out文件夹需没有创建
七. 各种错误的总结
1. map 100%,reduce 0%
一般是由于主机名和IP地址不对应造成的,仔细检查3个节点的/etc/hosts文件
2. Error: Java heap space
分配的堆内存不够,在mapred-site.xml中,将mapred.child.java.opts的值改大,改为1024试试看
3. namenode无法启动
请修改默认的临时目录,在上面的文章中有提到
4. Name node is in safe mode,或者JobTracker is in safe mode
Namenode并不会持久存储数据块与其存储位置的对应信息,因为这些信息是在HDFS集群启动由Namenode根据各Datanode发来的信息进行重建而来。这个重建过程被称为HDFS的安全模式。
这种情况下只需等待会就好