待处理文件hello.txt,内容如下:
john 91
mem 21
ave 33
sily 42
fdk 51
ksed 67
umkt 75
svv 28
john 11
mem 34
ave 33
sily 424
fdk 2115
ksed 7896
umkt 5237
svv 1238
john 111
mem 7832
ave 6773
sily 1234
fdk 523
ksed 667
umkt 117
svv 800
john 1111
mem 8900
ave 90
sily 48
fdk 37
ksed 52
umkt 10
svv 21
john 4
mem 23432
ave 210
sily 677
fdk 455
ksed 322
umkt 100
svv 723
这个文件没有具体的实际意义,找出相同名称对应的最大值。
我们将这个文件传到hdfs上去,
Hadoop dfs -put /home/hello.txt /home/
然后编写map.py脚本
#!/usr/bin/env Python
import re
import sys
for line in sys.stdin:
val = line.strip()
arr = val.split(" ")
if len(arr) >= 2:
print "%s %s" % (arr[0], arr[1])
接下来是reduce.py的脚本
#!/usr/bin/env python
import re
import sys
(last_key, max_val) = (None, 0)
for line in sys.stdin:
val = line.strip()
arr = val.split(" ")
if len(arr) >= 2:
if last_key and last_key != arr[0]:
print "%s %s" % (last_key, max_val)
(last_key, max_val) = (arr[0], int(arr[1]))
else:
(last_key, max_val) = (arr[0], max(max_val, int(arr[1])))
if last_key:
print "%s %s" % (last_key, max_val)
然后执行map reduce任务找出相同名下的最大值:
hadoop jar /usr/java/hadoop020/build/contrib/streaming/hadoop-streaming-0.20.jar -input /home/hello.txt -output /home/output -mapper "python /home/map.py" -reducer "python /home/reduce.py" -file /home/map.py -file /home/reduce.py
看看结果:
ave 6773
fdk 2115
john 1111
ksed 7896
mem 23432
sily 1234
svv 1238
umkt 5237
更多Hadoop相关信息:?Where=Nkey&Keyword=Hadoop