采样器－多输入路径－只采一个文件－（Multipl

日期：2020-10-09 栏目：程序人生浏览：次

Hadoop-采样器－多输入路径－只采一个文件－（MultipleInputs+getsample(conf.getInputFormat)

之前弄采样器，以为已经结束了工作，结果现在又遇到了问题，因为我的输入有两个文件，设计要求是先只采样其中的大文件（未来是两个文件分别采样的），只有一个输入文件且采样时，使用采样器的代码是：

Path input = new Path(args[0].toString());
input = input.makeQualified(input.getFileSystem(conf));

InputSampler.IntervalSampler<Text, NullWritable> sampler = new InputSampler.IntervalSampler<Text, NullWritable>(0.4, 5);

// 这句话的意思是两个分区，

// K[] getSample(InputFormat<K,V> inf, JobConf job) 函数原型

String skewuri_out = args[2] + "/sample_list"; // 存放采样的结果，不是分区的结果
FileSystem fs = FileSystem.get(URI.create(skewuri_out), conf);
FSDataOutputStream fs_out = fs.create(new Path(skewuri_out));

final InputFormat inf = conf.getInputFormat();//这个是获得Jobconf的InputFormat
Object[] p = sampler.getSample(inf, conf);// 输出采样的结果，必须前面是Object类型，换成I那头Writable就不管用了，不知道为什么

但是这样问题就来了，如果我写了两个Mapper类，分别为Map1class,Map2class,现在两个class分别处理两个不同输入路径的数据，目前是指定输入数据的格式是相同的，那么可以用MultipleInputs 来实现：

MultipleInputs.addInputPath(conf, new Path(args[0]), Definemyself.class,Map1class.class);
MultipleInputs.addInputPath(conf, new Path(args[1]), Definemyself.class,Map2class.class);

//Definemyself.class 是我自定义的继承了FileInputFormat ，并且实现了WritableComparable接口

//继承FileInputFormat 是采样的需要，实现WritableComparable接口，是因为我在join的时候想整体数据进行序列化，我自己也解释不明白这个序列化，可以理解成C里面的结构体吧，就是作为一个整体，可以toString()输出。

原型是：public class Definemyself extends FileInputFormat<Text,Text> implements WritableComparable{...}

这个问题从昨晚就困扰我，上周做梦采样，这种做梦还是采样。中午和老公出去吃的，因为要好好探讨一下这个问题，我的理论就是既然系统提供MultipleInputs，同时Jobconf有能调用getInputFormat(),就肯定有办法二者同时使用，不让就矛盾了，傻子才会建立这样的系统呢。

转载注明出处：http://www.heiqu.com/6ea1ca89c8e541ff85724d6974344b5e.html

采样器－多输入路径－只采一个文件－（Multipl

相关推荐