(假设和目标)
(硬件失效是常态)
(支持流式访问)
(大数据集)
(简单一致性模型)
(移动计算比移动数据更划算)
(轻便的跨异构的软硬件平台)
(文件系统命名空间)
(数据副本)
(副本选址:第一次小尝试)
(副本选择)
(安全模式)
(文件系统元数据的持久化)
(通讯协议)
(鲁棒性)
(数据磁盘失效、心跳机制和重新复制)
(集群调整)
(数据完整性)
(元数据磁盘失效)
(快照)
(数据结构)
(数据块)
(分段)
(复制管道)
(可访问性)
(FS命令行)
(用户管理命令行)
(浏览器接口)
(空间回收)
(文件删除和恢复)
(减少副本因子)
IntroductionThe Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is .
Hadoop分布式文件系统是一个设计可以运行在廉价硬件的分布式系统。它跟目前存在的分布式系统有很多相似之处。然而,不同之处才是重要的。HDFS是一个高容错和可部署在廉价机器上的系统。HDFS提供高吞吐数据能力适合处理大量数据。HDFS松散了一些需求使得支持流式传输。HDFS原本是为Apache Butch的搜索引擎设计的,现在是Apache Hadoop项目的子项目。
Assumptions and Goals Hardware Failure(硬件失效)Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
硬件失效是常态而不是意外。HDFS实例可能包含上百成千个服务器,每个节点存储着文件系统的部分数据。事实是集群有大量的节点,而每个节点都存在一定的概率失效也就意味着HDFS的一些组成部分经常失效。因此,检测错误、快速和自动恢复是HDFS的核心架构。
Streaming Data Access(流式访问)Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates.
应���运行在HDFS需要允许流式访问它的数据集。这不是普通的应用程序运行在普通的文件系统上。HDFS是被设计用于批量处理而非用户交互。设计的重点是高吞吐量访问而不是低延迟数据访问。POSIX语义在一些关键领域是用来提高吞吐量。
Large Data Sets(大数据集)Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
运行在HDFS的应用程序有大数据集。一个典型文档在HDFS是GB到TB级别的。因此,HDFS是用来支持大文件。它应该提供高带宽和可扩展到上百节点在一个集群中。它应该支持在一个实例中有以千万计的文件数。
Simple Coherency Model(简单一致性模型)