The placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. The current implementation for the replica placement policy is a first effort in this direction. The short-term goals of implementing this policy are to validate it on production systems, learn more about its behavior, and build a foundation to test and research more sophisticated policies.
Large HDFS instances run on a cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks.
The NameNode determines the rack id each DataNode belongs to via the process outlined in . A simple but non-optimal policy is to place replicas on unique racks. This prevents losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on component failure. However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks.
For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and the last on a different node in a different rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance.
The current, default replica placement policy described here is a work in progress.
副本的选址对HDFS的可靠性和性能是起到关键作用的。优化的副本选址使得HDFS有别于大多数分布式文件系统。这是一个需要大量调试和经验的特性。机架感知副本配置策略的目的是提高可靠性、可用性和网络带宽的利用率。目前的副本放置策略实现是第一次在这个方向上的努力。这个策略实现的短期目标是在生产环境上验证,更多地了解它的行为表现,建立一个基础用来测试和研究更好的策略。
运行在集群计算机的大型HDFS实例一般是分布在许多机架上。两个不同机架上的节点的通讯必须经过交换机。在大多数情况下,同一个机架上的不同机器之间的网络带宽要优于不同机架上的机器的。NameNode通过在Hadoop Rack Awarenes概述过程来确定每个DataNode属于哪个机架ID。一个简单但不是最佳的策略上是将副本部署在不同的机架上。这将避免一个机架失效时丢失数据和允许使用带宽来跨机架读取数据。这个策略平衡地将副本分布在集群中以平衡组件失效负载。然而,这个策略增加了写的负担因为一个块数据需要在多个机架之间传输。
通常情况下,当复制因子为3时,HDFS的副本放置策略是将一个副本放在本机架的一个节点上,将另一个副本放在本机架的另一个节点,最后一个副本放在不同机架的不同节点上。该策略减少机架内部的传输以提高写的性能。机架失效的概率要远低于节点失效;这个策略不会影响数据可靠性和可用性的保证。然而,它确实会减少数据读取时网络带宽的使用因为数据块只放置在两个单独的机架而不是三个。在这个策略当中,副本的分布不是均匀的。三分一个的副本放置在一个节点上,三分之二的副本放置在一个机架上,而另外三分之一均匀分布在剩余的机架上。这个策略提高了写性能而不影响数据可靠性和读性能。
目前,默认的副本放置策略描述的是正在进行的工作。
Replica Selection(副本选择)To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. If HDFS cluster spans multiple data centers, then a replica that is resident in the local data center is preferred over any remote replica.