Raft: 一点阅读笔记 (2)

注意在新的配置中,虽然新的机器不被计入Qurom,但Qurom仍然被定义为Majority,因此不存在双主问题,仅仅是Qurom不包括那台龟速机而已,集群仍然是可用的;同时要注意到这个错误也不会违背invariants;让我们再回顾一下为什么这套错误机制为什么会被引入:

The leader should also abort the change if the new server is unavailable or is so slow that it will never catch up. This check is important: Lamport’s ancient Paxon government broke down because they did not include it. They accidentally changed the membership to consist of only drowned sailors and could make no more progress [48]. Attempting to add a server that is unavailable or slow is often a mistake.(这个mistake应该指的是人工错误)

撤出机器

增添机器要担心的事情已经很多了,但撤出机器要烦心的事情远比增添机器要多得多;撤除机器要担忧的不再是对Avalibility的影响,而是具体的实现细节;

1)被撤出的机器什么时候可以关停?(注意我使用“撤出”来描述Conf Change,而不是“停机”or"关闭"之类的,意图在指出机器撤出并不意味着它不再和集群有通信)

2)当前的Leader不属于Cnew,该怎么办?

3)对于那些撤出的机器,该怎样处理RPC,包括要不要给它们发RPC,以及怎样处理来自它们的RPC?

Incomming RPC的处理

第3个问题是最严重的,因为它事关具体的实现;我们先解决一个子集,即如何处理那些自己收到的RPC,即Incomming RPC

先看一下作者的原话:

• A server accepts AppendEntries requests from a leader that is not part of the server’s latest configuration. Otherwise, a new server could never be added to the cluster (it would never accept any log entries preceding the configuration entry that adds the server).

• A server also grants its vote to a candidate that is not part of the server’s latest configuration (if the candidate has a sufficiently up-to-date log and a current term). This vote may occasionally be needed to keep the cluster available.

第一条其实又回到了增添机器这一操作的具体实施措施上;注意最初新机器的日志是空的,这意味着它不知道Cnew是什么东西,所以它必须接收任何来自Leader的AppendEntries,不然Catch Up机制就是一纸空谈;

第二条告诉我们,允许投票给那些不属于Cnew的机器;这是为了保证Avalibility;

用作者的原文总结:

Thus, servers process /incoming RPC requests/ without consulting their current configurations.

对于Incomming RPC全体照接不误;

Leader Step Down

A leader that is removed from the configuration steps down once the Cnew entry is committed.

Leader Step Down规则规定了当Leader ∉ Cnew时怎么应对Cnew Entry;对于那些非Leader的Server,收到Cnew时直接就要采用其中的配置,不需要其他的行动;对于Leader来说,它同样也要立刻采用其中的配置,但很显然它不应该继续做Leader了,因为新的Conf中不包含它,即我们期望它应当被移除掉;Raft算法要求这种Leader在顺利Commit了Cnew之后就Step Down;

一定要注意 Step Down != Shut Down,只是让Leader下台而已;注意到此时Leader只会向那些属于Cnew的Server发送AppendEntries(而不会给自己发送),因此新的Leader必定属于Cnew;

作者也吐槽了自己的算法引入的奇怪情景:

First, there will be a period of time (while it is committing Cnew) when a leader can manage a cluster that does not include itself; it replicates log entries but does not count itself in majorities.

Second, a server that is not part of its own latest configuration should still start new elections, as it might still be needed until the Cnew entry is committed (as in Figure 4.6). It does not count its own vote in elections unless it is part of its latest configuration.

第一,这意味存在这样一种情景,即一个不属于当前配置的Leader在管理整个集群,且这个Leader不会把自己算入Qurom中;

第二,如果创建Conf Change的机器最终不会在这次Conf Change中不会被移除,那没啥问题;问题在于如果Leader ∉ Cnew 但 Leader创建了Cnew的情景;假如在Cnew被提交前由于某些原因被罢黜了,它还不能躺平,还必须参与竞选!注意前文中我们知道Server对于一切的Incomming RPC是照单全收的,这造成了一个诡异的现象:一台Removed Server身上有Cnew,它知道自己不在Cnew中,但它仍然要参与竞选,它还可能胜选并把Cnew给提交;

Avoid Disruptions

Once the cluster leader has created the Cnew entry, a server that is not in Cnew will no longer receive heartbeats, so it will time out and start new elections. Furthermore, it will not receive the Cnew entry or learn of that entry’s commitment, so it will not know that it has been removed from the cluster.

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/zwyszy.html