DataNode将HDFS数据储存在他本地的文件系统中。DataNode对于HDFS文件一无所知。它将每一块HDFS数据存储为单独的文件在它的本地文件系统中。DataNode不会再相同的目录之下创建所有文件。相反,使用一个启发式的方法来确定每个目录的最优文件数目和恰当地创建子目录。在同一个目录下创建所有本地文件并不是最优的因为本地文化系统或许不是高效地支持在一个目录下有大量文件。当一个DataNode启动时,它将扫描本地文件系统,生成一个对应本地文件的所有HDFS数据块列表和将它发送给NameNode,这就是Blockreport。
The Communication Protocols (通讯协议)All HDFS communication protocols are layered on top of the TCP/IP protocol. A client establishes a connection to a configurable TCP port on the NameNode machine. It talks the ClientProtocol with the NameNode. The DataNodes talk to the NameNode using the DataNode Protocol. A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and the DataNode Protocol. By design, the NameNode never initiates any RPCs. Instead, it only responds to RPC requests issued by DataNodes or clients.
所有的HDFS通讯协议的底层都是TCP/IP协议。客户端通过NameNode机器上TCP端口与之建立连接。它使用客户端协议与NameNode通讯。DataNode使用DataNode协议与NameNode通讯。RPC抽象封装了客户端协议和DataNode协议。有意地,NameNode从不会主动发起任何RPC,相反,它只回复DatsNodes和客户端发来的RPC请求。
Robustness(鲁棒性)The primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are NameNode failures, DataNode failures and network partitions.
HDFS的主要目标是在失效出现时保证储存的数据的可靠性。通常有这三种失效,分别为NameNode失效,DataNode失效和网络分裂(一种在系统的任何两个组之间的所有网络连接同时发生故障后所出现的情况)
Data Disk Failure, Heartbeats and Re-Replication(数据磁盘失效,心跳机制和重新复制)Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.
The time-out to mark DataNodes dead is conservatively long (over 10 minutes by default) in order to avoid replication storm caused by state flapping of DataNodes. Users can set shorter interval to mark DataNodes as stale and avoid stale nodes on reading and/or writing by configuration for performance sensitive workloads.
每个节点周期性地发送心跳信息给NameNode。网络分裂会导致一部分DataNode失去与NameNode的连接。NameNode通过心跳信息的丢失发现这个情况。NameNode将最近没有心跳信息的DataNode标记为死亡并且不再转发任何IO请求给他们。在已经死亡的DataNode注册的任何数据在HDFS将不能再使用。DataNode死亡会导致部分数据块的复制因子小于指定的数目。NameNode时常地跟踪数据块是否需要被复制和当必要的时候启动复制。重新复制的必要性会因为许多原因而提升:DataNode不可用,一个副本被破坏,DataNode的磁盘失效或者一个文件的复制因子增加了。
将DatNode标记为死亡的超时时间适当地加长(默认超过10分钟)是为了避免DataNode状态改变引起的复制风暴。用户可以设置更短的时间间隔来标记DataNode为失效和避免失效的节点在读和/或写配置性能敏感的工作负载
Cluster Rebalancing(集群调整)