Hadoop 里执行 MapReduce 任务的几种常见方式(2)

2、基于 MR 的数据流 Like SQL 脚本开发语言:pig

A1 = load '/data/3.txt';
A = stream A1 through `sed "s/\t/ /g"`;
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = group C by word;
E = foreach D generate COUNT(C), group;
dump E;

注意:不同分隔符对load及后面的$0的影响。

详情请见:

https://gist.github.com/186460

Hadoop-pig

3、构建数据仓库的类 SQL 开发语言:hive

create table textlines(text string);
load data inpath '/data/3.txt' overwrite into table textlines;
SELECT wordColumn, count(1) FROM textlines LATERAL VIEW explode(split(text,'\t+')) wordTable AS wordColumn GROUP BY wordColumn;

详情请见:

4、跨平台的脚本语言:python

map:

#!/usr/bin/python
import os,re,sys
for line in sys.stdin:
for i in line.strip().split("\t"):
print i

reduce:

#!/usr/bin/python
import os,re,sys
arr = {}
for words in sys.stdin:
word = words.strip()
if word not in arr:
arr[word] = 1
else:
arr[word] += 1
for k, v in arr.items():
print str(k) + ": " + str(v)

最后在shell下执行:

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.203.0.jar -file map.py -file reduce.py -mapper map.py -reducer reduce.py -input /data/3.txt -output /data/py

注意:脚本开头需要显示指定何种解释器以及赋予脚本执行权限

详情请见:

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:http://www.heiqu.com/c14c5ff3d4ee2b92bba7c3c0b8557061.html