Hadoop 里执行 MapReduce 任务的几种常见方式(3)

5、Linux 下的瑞士军刀:shell 脚本

map:

#!/bin/bash
tr '\t' '\n'

reduce:

最后在shell下执行:

june@deepin:~/Hadoop/hadoop-0.20.203.0/tmp>
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.203.0.jar -file map.py -file reduce.py -mapper map.py -reducer reduce.py -input /data/3.txt -output /data/py
packageJobJar: [map.py, reduce.py, /home/june/data_hadoop/tmp/hadoop-unjar2676221286002400849/] [] /tmp/streamjob8722854685251202950.jar tmpDir=null
12/10/14 21:57:00 INFO mapred.FileInputFormat: Total input paths to process : 1
12/10/14 21:57:00 INFO streaming.StreamJob: getLocalDirs(): [/home/june/data_hadoop/tmp/mapred/local]
12/10/14 21:57:00 INFO streaming.StreamJob: Running job: job_201210141552_0041
12/10/14 21:57:00 INFO streaming.StreamJob: To kill this job, run:
12/10/14 21:57:00 INFO streaming.StreamJob: /home/june/hadoop/hadoop-0.20.203.0/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201210141552_0041
12/10/14 21:57:00 INFO streaming.StreamJob: Tracking URL: :50030/jobdetails.jsp?jobid=job_201210141552_0041
12/10/14 21:57:01 INFO streaming.StreamJob: map 0% reduce 0%
12/10/14 21:57:13 INFO streaming.StreamJob: map 67% reduce 0%
12/10/14 21:57:19 INFO streaming.StreamJob: map 100% reduce 0%
12/10/14 21:57:22 INFO streaming.StreamJob: map 100% reduce 22%
12/10/14 21:57:31 INFO streaming.StreamJob: map 100% reduce 100%
12/10/14 21:57:37 INFO streaming.StreamJob: Job complete: job_201210141552_0041
12/10/14 21:57:37 INFO streaming.StreamJob: Output: /data/py
june@deepin:~/hadoop/hadoop-0.20.203.0/tmp>
hadoop fs -cat /data/py/part-00000
1 aa
1 bb
1 bb
2 cc
1 dd
june@deepin:~/hadoop/hadoop-0.20.203.0/tmp>

特别提示:上述有些方法对字段后的空格忽略或计算,请注意仔细甄别。

说明:列举了上述几种方法主要是给大家一个不同的思路,

在解决问题的过程中,开发效率、执行效率都是我们需要考虑的,不要太局限某一种方法了。

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:http://www.heiqu.com/c14c5ff3d4ee2b92bba7c3c0b8557061.html