Hadoop数据输入的源码解析

日期：2020-06-18 栏目：程序人生浏览：次

我们知道，任何一个工程项目，最重要的是三个部分：输入，中间处理，输出。今天我们来深入的了解一下我们熟知的Hadoop系统中，输入是如何输入的？

　　在hadoop中，输入数据都是通过对应的InputFormat类和RecordReader类来实现的，其中InputFormat来实现将对应输入文件进行分片，RecordReader类将对应分片中的数据读取进来。具体的方式如下：

（1）InputFormat类是一个接口。

public interface InputFormat<K, V> {

InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;

RecordReader<K, V> getRecordReader(InputSplit split,

JobConf job,

Reporter reporter) throws IOException;

}

（2）FileInputFormat类实现了InputFormat接口。该类实现了getSplits方法，但是它也没有实现对应的getRecordReader方法。也就是说FileInputFormat还是一个抽象类。这里需要说明的一个问题是，FileInputFormat用isSplitable方法来指定对应的文件是否支持数据的切分。默认情况下都是支持的，一般子类都需要重新实现它。

public abstract class FileInputFormat<K, V> implements InputFormat<K, V> {

public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException {

FileStatus[] files = listStatus(job);

// Save the number of input files in the job-conf

job.setLong(NUM_INPUT_FILES, files.length);

long totalSize = 0; // compute total size

for (FileStatus file: files) { // check we have valid files

if (file.isDir()) {

throw new IOException("Not a file: "+ file.getPath());

}

totalSize += file.getLen();

}

long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);

long minSize = Math.max(job.getLong("mapred.min.split.size", 1),

minSplitSize);

// generate splits

ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);

NetworkTopology clusterMap = new NetworkTopology();

for (FileStatus file: files) {

Path path = file.getPath();

FileSystem fs = path.getFileSystem(job);

long length = file.getLen();

BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);

if ((length != 0) && isSplitable(fs, path)) {

long blockSize = file.getBlockSize();

long splitSize = computeSplitSize(goalSize, minSize, blockSize);

long bytesRemaining = length;

while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {

String[] splitHosts = getSplitHosts(blkLocations,

length-bytesRemaining, splitSize, clusterMap);

splits.add(new FileSplit(path, length-bytesRemaining, splitSize,

splitHosts));

bytesRemaining -= splitSize;

}

if (bytesRemaining != 0) {

splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,

blkLocations[blkLocations.length-1].getHosts()));

}

} else if (length != 0) {

String[] splitHosts = getSplitHosts(blkLocations,0,length,clusterMap);

splits.add(new FileSplit(path, 0, length, splitHosts));

} else {

//Create empty hosts array for zero length files

splits.add(new FileSplit(path, 0, length, new String[0]));

}

LOG.debug("Total # of splits: " + splits.size());

return splits.toArray(new FileSplit[splits.size()]);

}

//该方法是用来判断是否可以进行数据的切分

protected boolean isSplitable(FileSystem fs, Path filename) {

return true;

}

//但是它也没有实现对应的getRecordReader方法。也就是说FileInputFormat还是一个抽象类。

public abstract RecordReader<K, V> getRecordReader(InputSplit split,

JobConf job,

转载注明出处：https://www.heiqu.com/6e8a7a6e5cec6e54203adffd01a016d0.html

Hadoop数据输入的源码解析

相关推荐