Which mode should I pick?
如何在各种模式中进行选择?
It's important to understand that allowing RabbitMQ to deal with network partitions automatically does not make them less of a problem. Network partitions will always cause problems for RabbitMQ clusters; you just get some degree of choice over what kind of problems you get. As stated in the introduction, if you want to connect RabbitMQ clusters over generally unreliable links, you should use federation or the shovel.
需要明确的一点是,允许 RabbitMQ 自行处理网络分区问题并不代表你可以认为该问题就不存在了。无论何时网络分区都会导致 RabbitMQ 集群产生问题。你只是在可能遇到何种层次的问题上面多了些选择。正如在本文开始处所说的,如果你打算基于不可靠连接接入 RabbitMQ 集群,你应该使用 federation 或 shovel 。
With that said, you might wish to pick a recovery mode as follows:
下面回答如何进行选择的问题,你可以按照如下的说明进行恢复策略选择:
ignore - Your network really is reliable. All your nodes are in a rack, connected with a switch, and that switch is also the route to the outside world. You don't want to run any risk of any of your cluster shutting down if any other part of it fails (or you have a two node cluster).
ignore - 要求你所在的网络环境非常可靠。例如,你的所有 node 都在同一个机架上,通过交换机互联,并且该交换机还是与外界通信的必经之路。 并且你不想因为集群中的任意 node 失效而导致集群停工,即使集群中有 node 真的失效。
pause_minority - Your network is maybe less reliable. You have clustered across 3 AZs in EC2, and you assume that only one AZ will fail at once. In that scenario you want the remaining two AZs to continue working and the nodes from the failed AZ to rejoin automatically and without fuss when the AZ comes back.
pause_minority - 你的网络环境可能没有那么可靠。例如,你在 EC2 上构建了一个横跨 3 个 AZs 的集群,并且你假定同一时刻最多只有一个 AZ 会失效。在这种场景下,你希望剩余的 2 个 AZs 能够继续工作,直到失效 AZ 恢复后,位于其中的 node 重新自动加入集群,并且不会造成任何混乱。
autoheal - Your network may not be reliable. You are more concerned with continuity of service than with data integrity. You may have a two node cluster.
autoheal - 你的网络环境可能是不可靠的。你会更加关心服务的可持续性,而非数据完整性。你可以构建一个包含 2 个 node 的集群。
More about pause-minority mode
关于 pause-minority 模式的更多说明
The Erlang VM on the paused nodes will continue running but the nodes will not listen on any ports or do any other work. They will check once per second to see if the rest of the cluster has reappeared, and start up again if it has.
被动关停服务的 node 上的 Erlang VM 将持续运行,但该 node 将不再监听任何 port ,也不会再进行任何工作。这种 node 会每秒检查一次集群中的其余 node 是否已重新出现,并在检查成功后重新激活自身的服务。
Note that nodes will not enter the paused state at startup, even if they are in a minority then. It is expected that any such minority at startup is due to the rest of the cluster not having been started yet.
值得注意的是,在启动阶段 node 不会进入关停状态,即使当前 node 确实处于少数派集群中。我们认为在启动阶段出现的这种少数派集群,是由于集群中的其他 node 尚未启动好的缘故。
Also note that RabbitMQ will pause nodes which are not in a strict majority of the cluster - i.e. containing more than half of all nodes. It is therefore not a good idea to enable pause-minority mode on a cluster of two nodes since in the event of any network partition or node failure, both nodes will pause. However, pause_minority mode is likely to be safer than ignore mode for clusters of more than two nodes, especially if the most likely form of network partition is that a single minority of nodes drops off the network.
同样需要注意的是,RabbitMQ 会停掉未处于严格意义上的多数派集群中的 node 。所以,在由两个 node 构成的集群上使用 pause_minority 模式是不明智的,因为只要出现网络分区,或者任意 node 失效,都会导致两个 node 同时被关停。然而,在集群包含多于 2 个 node 的情况下,pause_minority 模式很可能比 ignore 模式更安全,尤其是在最可能发生的一种网络分区情况中,即仅有一个 node 作为少数派集群发生了网络分区。
Finally, note that pause_minority mode will do nothing to defend against partitions caused by cluster nodes being suspended. This is because the suspended node will never see the rest of the cluster vanish, so will have no trigger to disconnect itself from the cluster.
最后需要注意的一点是,处于 pause_minority 模式下的 node 一旦被挂起,就无法处理(后续发生的)网络分区情况了。这是因为被挂起的 node 无法看到集群中其他 node 的消失,所以也就无法触发将自身从集群中断开的行为。
CentOS 7.2 下 RabbitMQ 集群搭建