2、基于 MR 的数据流 Like SQL 脚本开发语言:pig
A1 = load '/data/3.txt';
A = stream A1 through `sed "s/\t/ /g"`;
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = group C by word;
E = foreach D generate COUNT(C), group;
dump E;
注意:不同分隔符对load及后面的$0的影响。
详情请见:
https://gist.github.com/186460
Hadoop-pig
3、构建数据仓库的类 SQL 开发语言:hive
create table textlines(text string);
load data inpath '/data/3.txt' overwrite into table textlines;
SELECT wordColumn, count(1) FROM textlines LATERAL VIEW explode(split(text,'\t+')) wordTable AS wordColumn GROUP BY wordColumn;
详情请见:
4、跨平台的脚本语言:python
map:
#!/usr/bin/python
import os,re,sys
for line in sys.stdin:
for i in line.strip().split("\t"):
print i
reduce:
#!/usr/bin/python
import os,re,sys
arr = {}
for words in sys.stdin:
word = words.strip()
if word not in arr:
arr[word] = 1
else:
arr[word] += 1
for k, v in arr.items():
print str(k) + ": " + str(v)
最后在shell下执行:
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.203.0.jar -file map.py -file reduce.py -mapper map.py -reducer reduce.py -input /data/3.txt -output /data/py
注意:脚本开头需要显示指定何种解释器以及赋予脚本执行权限