After the expiry of its life in trash, the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.
如果垃圾相关配置是可用的,通过FS shell移除的文件将不会直接从HDFS移除。相反的,HDFS将它移动到一个回收目录(每个用户在/usr/<username>/.Trash下都拥有它自己的回收站目录)。一个文件只要还在回收站那么就能够快速恢复。
大部分最近删除的文件都将移动到当前的回收站目录(/user/<username>/.Trash/Current),并且在设置好的时间间隔内,HDFS创建对 /user/<username>/.Trash/<date>目录下的文件创建一个检查点并且当老的检查点过期的时候删除他们。查看 了解回收站的检查点。
当文件在回收站期满之后,NameNode将会将文件从HDFS的命名空间中删除。文件的删除将导致与该文件关联的block被释放。需要说明的是文件被用户删除的时间和对应的释放空间的时间之间有一个明显的时间延迟。
Following is an example which will show how the files are deleted from HDFS by FS Shell. We created 2 files (test1 & test2) under the directory delete
接下来是我们展示如何通过FS shel删除文件的例子。我们在要删除的目录中创建test1和test2两个文件
$ hadoop fs -mkdir -p delete/test1
$ hadoop fs -mkdir -p delete/test2
$ hadoop fs -ls delete/
Found 2 items
drwxr-xr-x - hadoop hadoop 0 2015-05-08 12:39 delete/test1
drwxr-xr-x - hadoop hadoop 0 2015-05-08 12:40 delete/test2
We are going to remove the file test1. The comment below shows that the file has been moved to Trash directory.
我们来删除文件test1.下面的注释显示文档被移除到回收站目录。
$ hadoop fs -rm -r delete/test1
Moved: hdfs://localhost:8020/user/hadoop/delete/test1 to trash at: hdfs://localhost:8020/user/hadoop/.Trash/Current
now we are going to remove the file with skipTrash option, which will not send the file to Trash.It will be completely removed from HDFS.
现在我来执行将文件删除跳过回收站选项,文件则不会转移到回收站。文件将完全从HDFS中移除。
$ hadoop fs -rm -r -skipTrash delete/test2
Deleted delete/test2
We can see now that the Trash directory contains only file test1.
我们现在可以看到回收站目录下只有test1
$ hadoop fs -ls .Trash/Current/user/hadoop/delete/
Found 1 items\
drwxr-xr-x - hadoop hadoop 0 2015-05-08 12:39 .Trash/Current/user/hadoop/delete/test1
So file test1 goes to Trash and file test2 is deleted permanently.
所以test1去了回收站而test2被永久地删除了。
Decrease Replication Factor (减少副本因子)When the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then removes the corresponding blocks and the corresponding free space appears in the cluster. Once again, there might be a time delay between the completion of the setReplication API call and the appearance of free space in the cluster.
当文件的副本因子减小时,NameNode将在可以删除的副本中选中多余的副本。在下一个心跳通讯中将该信息传输给DataNode。然后DataNode移除对应的数据块并且释放对应的空间。再重申一遍,在设置副本因子完成和集群中出现新的空间之间有个时间延迟。
ReferencesHadoop JavaDoc API.
*由于译者本身能力有限,所以译文中肯定会出现表述不正确的地方,请大家多多包涵,也希望大家能够指出文中翻译得不对或者不准确的地方,共同探讨进步,谢谢。
*用红色标注的句子是翻译得不顺的地方,所以如果大家有更好的翻译,请在评论中在告诉我,谢谢!
下面关于Hadoop的文章您也可能喜欢,不妨看看:
Ubuntu14.04下Hadoop2.4.1单机/伪分布式安装配置教程
CentOS安装和配置Hadoop2.2.0