Hadoop控制输出文件命名

日期：2020-08-29 栏目：程序人生浏览：次

Hadoop 控制输出文件命名

在一般情况下，Hadoop 每一个 Reducer 产生一个输出文件，文件以part-r-00000、part-r-00001 的方式进行命名。如果需要人为的控制输出文件的命名或者每一个 Reducer 需要写出多个输出文件时，可以采用MultipleOutputs 类来完成。MultipleOutputs 采用输出记录的键值对（output Key 和 output Value)或者
任意字符串来生成输出文件的名字，文件一般以 name-r-nnnnn 的格式进行命名，其中 name 是程序设置的任意名字；nnnnn 表示分区号。

MultipleOutputs 的使用方式的使用方式：：：：
想要使用 MultipeOutputs，需要完成以下四个步骤：

1. 在 Reducer 中声明 MultipleOutputs 的变量
private MultipleOutputs<NullWritable, Text> multipleOutputs;

2. 在 Reducer 的 setup 函数中进行 MultipleOutputs 的初始化
protected void setup(Context context)throws IOException, InterruptedException {
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
}

3. 在 reduce 函数中进行输出控制
protected void reduce(Text key, Iterable<Text> values, Context context)throws IOException,
InterruptedException {
for (Text value : values) {
multipleOutputs.write(NullWritable.get(), value, key.toString());
}
}

4. 在 cleanup 函数中关闭输出 MultipleOutputs
protected void cleanup(Context context)throws IOException, InterruptedException {
multipleOutputs.close();
}

注意：multipleOutputs.write(key, value, baseOutputPath)方法的第三个函数表明了该输出所在的目录（相对于用户指定的输出目录）。如果baseOutputPath不包含文件分隔符“/”，那么输出的文件格式为baseOutputPath-r-nnnnn（name-r-nnnnn)；如果包含文件分隔符“/”，例如baseOutputPath=“029070-99999/1901/part”，那么输出文件则为

上一篇：保障 Hadoop 数据安全的十大措施

下一篇：SUSE Linux的CPU节电模式引发的故障案例解析

内容版权声明：除非注明，否则皆为本站原创文章。

转载注明出处：http://www.heiqu.com/d44f79a3f1aa995c2ef429859ea8f980.html

相关推荐

2021-04-091vue.js在标签属性中插入变量参数的方法

2021-04-092cli 打包时抽离项目相关配置文件详解

2021-04-093解决iview打包时UglifyJs报错的问题

2021-04-094Ubuntu 编译运行C笔记

2021-04-095Android 多线程断点续传下载器