我们知道,任何一个工程项目,最重要的是三个部分:输入,中间处理,输出。今天我们来深入的了解一下我们熟知的Hadoop系统中,输入是如何输入的?
在hadoop中,输入数据都是通过对应的InputFormat类和RecordReader类来实现的,其中InputFormat来实现将对应输入文件进行分片,RecordReader类将对应分片中的数据读取进来。具体的方式如下:
(1)InputFormat类是一个接口。
public interface InputFormat<K, V> {
InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
RecordReader<K, V> getRecordReader(InputSplit split,
JobConf job,
Reporter reporter) throws IOException;
}
(2)FileInputFormat类实现了InputFormat接口。该类实现了getSplits方法,但是它也没有实现对应的getRecordReader方法。也就是说FileInputFormat还是一个抽象类。这里需要说明的一个问题是,FileInputFormat用isSplitable方法来指定对应的文件是否支持数据的切分。默认情况下都是支持的,一般子类都需要重新实现它。
public abstract class FileInputFormat<K, V> implements InputFormat<K, V> {
public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException {
FileStatus[] files = listStatus(job);
// Save the number of input files in the job-conf
job.setLong(NUM_INPUT_FILES, files.length);
long totalSize = 0; // compute total size
for (FileStatus file: files) { // check we have valid files
if (file.isDir()) {
throw new IOException("Not a file: "+ file.getPath());
}
totalSize += file.getLen();
}
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
long minSize = Math.max(job.getLong("mapred.min.split.size", 1),
minSplitSize);
// generate splits
ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
NetworkTopology clusterMap = new NetworkTopology();
for (FileStatus file: files) {
Path path = file.getPath();
FileSystem fs = path.getFileSystem(job);
long length = file.getLen();
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
if ((length != 0) && isSplitable(fs, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(goalSize, minSize, blockSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
String[] splitHosts = getSplitHosts(blkLocations,
length-bytesRemaining, splitSize, clusterMap);
splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
splitHosts));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,
blkLocations[blkLocations.length-1].getHosts()));
}
} else if (length != 0) {
String[] splitHosts = getSplitHosts(blkLocations,0,length,clusterMap);
splits.add(new FileSplit(path, 0, length, splitHosts));
} else {
//Create empty hosts array for zero length files
splits.add(new FileSplit(path, 0, length, new String[0]));
}
}
LOG.debug("Total # of splits: " + splits.size());
return splits.toArray(new FileSplit[splits.size()]);
}
//该方法是用来判断是否可以进行数据的切分
protected boolean isSplitable(FileSystem fs, Path filename) {
return true;
}
//但是它也没有实现对应的getRecordReader方法。也就是说FileInputFormat还是一个抽象类。
public abstract RecordReader<K, V> getRecordReader(InputSplit split,
JobConf job,