为了防止诨调用rollFSImage(),系统引入了状态CheckpointStates.UPLOAD_DONE。原因如下:
rollFSImage是将fsimage.ckpt、edits.new这两个文件分别roll成fsimage和edits。只有到达UPLOAD_DONE状态才会有这两个文件。在rollFSImage开头会验证当前checkpoint状态是否为UPLOAD_DONE。
NN上Checkpoint热备份过程中状态的变化图:
在‘NN上的checkpoint状态图’中rollFSImage可能出现namenode故障,导致fsimage.ckpt-->fsimage、edits.new-->edits失败。在从fsimage文件读取目录结构时,会调用recoverInterruptedCheckpoint方法对这种情况进行检测并恢复:
boolean recoverInterruptedCheckpoint(StorageDirectory nameSD,
StorageDirectory editsSD)
throws IOException {
boolean needToSave = false;
File curFile = getImageFile(nameSD, NameNodeFile.IMAGE);
File ckptFile = getImageFile(nameSD, NameNodeFile.IMAGE_NEW);
//此时已经是UPLOAD_DONE
//
// If we were in the midst of a checkpoint
//
if (ckptFile.exists()) {
//fsimage.ckpt存在,rollFSImage没成功
needToSave = true;
if (getImageFile(editsSD, NameNodeFile.EDITS_NEW).exists()) {
//edits.new存在,说明没有执行到rollFSImage,所以不确定fsimage.ckpt是否上传成功,废弃fsimage.ckpt
//
// checkpointing migth have uploaded a new
// merged image, but we discard it here because we are
// not sure whether the entire merged image was uploaded
// before the namenode crashed.
//
if (!ckptFile.delete()) {
throw new IOException("Unable to delete " + ckptFile);
}
} else {//edits.new文件不存在,在rollFSImage中是先edits.new-->edits,
//此处fsimage.ckpt文件存在说明没将fsimage.ckpt-->fsimage,所以
//只需对fsimage.ckpt进行再次重命名即可
// checkpointing was in progress when the namenode
// shutdown. The fsimage.ckpt was created and the edits.new
// file was moved to edits. We complete that checkpoint by
// moving fsimage.new to fsimage. There is no need to
// update the fstime file here. renameTo fails on Windows
// if the destination file already exists.
//
if (!ckptFile.renameTo(curFile)) {
if (!curFile.delete())
LOG.warn("Unable to delete dir " + curFile + " before rename");
if (!ckptFile.renameTo(curFile)) {
throw new IOException("Unable to rename " + ckptFile +
" to " + curFile);
}
}
}
}
return needToSave;
}