[etcd] Raft理论学习(二)(选举+复制日志)

4 Designing for understandability

We had several goals in designing Raft: it must provide a complete and practical foundation for system building,
so that it significantly reduces the amount of design work required of developers; itmust be safe under all conditions
and available under typical operating conditions; and it must be efficient for common operations. But our most
important goal—and most difficult challenge—was understandability. It must be possible for a large audience to
understand the algorithmcomfortably. In addition, it must be possible to develop intuitions about the algorithm, so
that system builders can make the extensions that are inevitable in real-world implementations.

There were numerous points in the design of Raft where we had to choose among alternative approaches.
In these situations we evaluated the alternatives based on understandability: how hard is it to explain each alternative
(for example, how complex is its state space, and does it have subtle implications?), and how easy will it be for a
reader to completely understand the approach and its implications?

We recognize that there is a high degree of subjectivity in such analysis; nonetheless, we used two techniques
that are generally applicable. The first technique is the well-known approach of problem decomposition: wherever
possible, we divided problems into separate pieces that could be solved, explained, and understood relatively
independently. For example, in Raft we separated leader election, log replication, safety, and membership changes.

Our second approach was to simplify the state space by reducing the number of states to consider, making the
system more coherent and eliminating nondeterminism where possible. Specifically, logs are not allowed to have
holes, and Raft limits the ways in which logs can become inconsistent with each other. Although in most cases we
tried to eliminate nondeterminism, there are some situations where nondeterminism actually improves understandability.
In particular, randomized approaches introduce nondeterminism, but they tend to reduce the state space by handling all possible choices in a similar fashion (“choose any; it doesn’t matter”).We used randomization to simplify the Raft leader election algorithm.

4为易于理解而设计

我们在设计Raft时有几个目标:它必须为系统构建提供完整而实用的基础, 从而大大减少了开发人员所需的设计工作量;在任何情况下都必须安全 并在典型操作条件下可用;并且对于常规操作必须高效。但是我们最 重要目标(也是最困难的挑战)是可理解性。广大观众一定有可能 轻松理解算法。另外,必须有可能发展出关于算法的直觉,因此 系统构建者可以进行实际实现中不可避免的扩展。

在Raft设计中有很多要点,我们必须在其他方法中进行选择。 在这种情况下,我们根据可理解性评估了替代方案:解释每种替代方案有多难 (例如,其状态空间有多复杂,是否有细微的影响?),对于一个状态来说,它有多容易 读者能完全理解该方法及其含义吗?

我们认识到,这种分析具有很高的主观性。但是,我们使用了两种技术 通常适用。第一种技术是众所周知的问题分解方法: 可能的话,我们将问题分为几个独立的部分,可以相对解决,解释和理解 独立地。例如,在Raft中,我们分离了领导者选举,日志复制,安全性和成员资格更改。

我们的第二种方法是通过减少要考虑的状态数来简化状态空间, 系统更连贯,并尽可能消除不确定性。具体来说,不允许日志包含 holes,并且Raft限制了日志彼此不一致的方式。尽管在大多数情况下, 为了消除不确定性,在某些情况下不确定性实际上提高了可理解性。 特别地,随机方法引入了不确定性,但是它们倾向于通过以相似的方式处理所有可能的选择来减少状态空间(“选择任何一个,不要紧”)。我们使用随机化来简化Raft领导者选举算法。

5 The Raft consensus algorithm

Raft is an algorithm for managing a replicated log of the form described in Section 2. Figure 2 summarizes the
algorithm in condensed form for reference, and Figure 3 lists key properties of the algorithm; the elements of these
figures are discussed piecewise over the rest of this section.

Raft implements consensus by first electing a distinguished leader, then giving the leader complete responsibility
for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers,
and tells servers when it is safe to apply log entries to their state machines. Having a leader simplifies the management
of the replicated log. For example, the leader can decide where to place new entries in the log without consulting
other servers, and data flows in a simple fashion from the leader to other servers. A leader can fail or become
disconnected from the other servers, in which case a new leader is elected.

Given the leader approach, Raft decomposes the consensus problem into three relatively independent subproblems,
which are discussed in the subsections that follow:

• Leader election: a new leader must be chosen when an existing leader fails (Section 5.2).

• Log replication: the leader must accept log entries from clients and replicate them across the cluster,
forcing the other logs to agree with its own (Section5.3).

• Safety: the key safety property for Raft is the State Machine Safety Property in Figure 3: if any server
has applied a particular log entry to its state machine, then no other server may apply a different command
for the same log index. Section 5.4 describes how Raft ensures this property; the solution involves an
additional restriction on the election mechanism described in Section 5.2.

After presenting the consensus algorithm, this section discusses the issue of availability and the role of timing in the
system.

5 Raft共识算法

Raft是一种用于管理第2节中所述形式的复制日志的算法。图2总结了 该算法以简明形式提供参考,图3列出了该算法的关键属性;这些要素 在本节的其余部分将对图进行分段讨论。 Raft通过首先选举一位杰出的领导者,然后赋予领导者完全责任来实现共识 用于管理复制的日志。领导者接受来自客户端的日志条目,将其复制到其他服务器上, 并告诉服务器何时可以安全地将日志条目应用于其状态机。拥有领导者可以简化管理 复制的日志。例如,领导者无需咨询即可决定将新条目放置在日志中的位置 其他服务器,数据以简单的方式从领导者流向其他服务器。领导者可能失败或成为 与其他服务器断开连接,在这种情况下,将选举新的领导者。

使用领导者方法,Raft将共识问题分解为三个相对独立的子问题, 以下各小节将对此进行讨论:

•领导者选举:当现有领导者失败时,必须选择新的领导者(第5.2节)。

•日志复制:领导者必须接受来自客户端的日志条目,并在整个集群中复制它们, 强制其他日志与其自己的日志一致(第5.3节)。

•安全:Raft的关键安全属性是图3中的状态机安全属性:如果有服务器 已将特定的日志条目应用于其状态机,则其他服务器均不能应用其他命令 对于相同的日志索引。 5.4节描述了Raft如何确保此属性;解决方案涉及 5.2节中所述的选举机制的其他限制。 在介绍了共识算法之后,本节将讨论可用性问题以及计时在 系统。

[etcd] Raft理论学习(二)(选举+复制日志)[etcd] Raft理论学习(二)(选举+复制日志)

[etcd] Raft理论学习(二)(选举+复制日志)[etcd] Raft理论学习(二)(选举+复制日志)

5.1 Raft basics

A Raft cluster contains several servers; five is a typical number, which allows the system to tolerate two failures.
At any given time each server is in one of three states: leader, follower, or candidate. In normal operation there
is exactly one leader and all of the other servers are followers. Followers are passive: they issue no requests on
their own but simply respond to requests from leaders and candidates. The leader handles all client requests (if
a client contacts a follower, the follower redirects it to the leader). The third state, candidate, is used to elect a new
leader as described in Section 5.2. Figure 4 shows the states and their transitions; the transitions are discussed
below.

Raft divides time into terms of arbitrary length, as shown in Figure 5. Terms are numbered with consecutive
integers. Each term begins with an election, in which one or more candidates attempt to become leader as described
in Section 5.2. If a candidate wins the election, then it serves as leader for the rest of the term. In some situations
an election will result in a split vote. In this case the term will end with no leader; a new term (with a new election)

will begin shortly. Raft ensures that there is at most one  leader in a given term.

5.1Raft基础

一个Raft集群包含多个服务器;五是一个典型数字,它允许系统容忍两个故障。 在任何给定时间,每个服务器都处于以下三种状态之一:领导者,关注者或候选者。在正常操作那里 是一个领导者,其他所有服务器都是跟随者。追随者是被动的:他们不会在以下方面发出请求 他们自己,但只是响应领导者和候选人的要求。领导者处理所有客户请求(如果 客户与关注者联系,则关注者会将其重定向到领导者。第三种状态,候选人,用于选举新的 第5.2节所述的领导者。图4显示了状态及其转换。讨论了过渡 下面。

Raft将时间划分为任意长度,如图5所示。 整数。每个term都以选举开始,在选举中,一个或多个候选人尝试按所述方式成为领导者 在5.2节中。如果候选人在选举中获胜,则它将在剩余任期中担任领导职务。在某些情况下 选举将产生分裂投票。在这种情况下,该词将以无首字母结尾;新任期(以新的选举) 即将开始。Raft确保给定任期内最多有一位领导者。

[etcd] Raft理论学习(二)(选举+复制日志)

Different servers may observe the transitions between terms at different times, and in some situations a server
may not observe an election or even entire terms. Terms act as a logical clock [14] in Raft, and they allow servers
to detect obsolete information such as stale leaders. Each server stores a current term number, which increases
monotonically over time. Current terms are exchanged whenever servers communicate; if one server’s current
term is smaller than the other’s, then it updates its current term to the larger value. If a candidate or leader discovers
that its term is out of date, it immediately reverts to follower state. If a server receives a request with a stale term
number, it rejects the request.

Raft servers communicate using remote procedure calls (RPCs), and the basic consensus algorithm requires only
two types of RPCs. RequestVote RPCs are initiated by candidates during elections (Section 5.2), and Append-
Entries RPCs are initiated by leaders to replicate log entries and to provide a form of heartbeat (Section 5.3). Section
7 adds a third RPC for transferring snapshots between servers. Servers retry RPCs if they do not receive a response
in a timely manner, and they issue RPCs in parallel for best performance.

不同的服务器可能会在不同时间观察术语之间的转换,在某些情况下,服务器 可能不会遵守选举甚至整个条款。术语在Raft中充当逻辑时钟[14],它们允许服务器 检测过时的信息,例如过时的领导者。每个服务器都存储一个当前的术语编号,该编号增加 随着时间的流逝单调。只要服务器进行通信,就会交换当前条款;如果一台服务器当前 term小于另一个,然后将其当前条款更新为较大的值。如果候选人或领导者发现 它的期限已过时,则立即恢复为关注者状态。如果服务器收到带有陈旧条件的请求 数字,它拒绝该请求。

Raft服务器使用远程过程调用(RPC)进行通信,并且基本共识算法仅需要 两种类型的RPC。 RequestVote RPC是由候选人在选举期间发起的(第5.2节),并在 条目RPC由领导者发起,以复制日志条目并提供心跳信号形式(第5.3节)。部分 7添加了第三个RPC,用于在服务器之间传输快照。如果服务器未收到响应,则重试RPC 并及时发布RPC,以获得最佳性能。

5.2 Leader election

Raft uses a heartbeat mechanism to trigger leader election. When servers start up, they begin as followers. A
server remains in follower state as long as it receives valid 5 RPCs from a leader or candidate. Leaders send periodic
heartbeats (AppendEntriesRPCs that carry no log entries) to all followers in order to maintain their authority. If a
follower receives no communication over a period of time called the election timeout, then it assumes there is no viable
leader and begins an election to choose a new leader.

To begin an election, a follower increments its current term and transitions to candidate state. It then votes for
itself and issues RequestVote RPCs in parallel to each of the other servers in the cluster. A candidate continues in
this state until one of three things happens: (a) it wins the election, (b) another server establishes itself as leader, or
(c) a period of time goes by with no winner. These outcomes are discussed separately in the paragraphs below.

A candidate wins an election if it receives votes from a majority of the servers in the full cluster for the same
term. Each server will vote for at most one candidate in a given term, on a first-come-first-served basis (note: Section
5.4 adds an additional restriction on votes). The majority rule ensures that at most one candidate can win the
election for a particular term (the Election Safety Property in Figure 3). Once a candidate wins an election, it
becomes leader. It then sends heartbeat messages to all of the other servers to establish its authority and prevent new
elections.

While waiting for votes, a candidate may receive an AppendEntries RPC from another server claiming to be
leader. If the leader’s term (included in its RPC) is at least as large as the candidate’s current term, then the candidate
recognizes the leader as legitimate and returns to follower state. If the term in the RPC is smaller than the candidate’s
current term, then the candidate rejects the RPC and continues in candidate state.

The third possible outcome is that a candidate neither wins nor loses the election: if many followers become
candidates at the same time, votes could be split so that no candidate obtains a majority.When this happens, each
candidate will time out and start a new election by incrementing its term and initiating another round of Request-
Vote RPCs. However, without extra measures split votes could repeat indefinitely.

Raft uses randomized election timeouts to ensure that split votes are rare and that they are resolved quickly. To
prevent split votes in the first place, election timeouts are chosen randomly from a fixed interval (e.g., 150–300ms).
This spreads out the servers so that in most cases only a single server will time out; it wins the election and sends
heartbeats before any other servers time out. The same mechanism is used to handle split votes. Each candidate
restarts its randomized election timeout at the start of an election, and it waits for that timeout to elapse before
starting the next election; this reduces the likelihood of another split vote in the new election. Section 9.3 shows
that this approach elects a leader rapidly.

Elections are an example of how understandability guided our choice between design alternatives. Initially
we planned to use a ranking system: each candidate was assigned a unique rank, which was used to select between
competing candidates. If a candidate discovered another candidate with higher rank, it would return to follower
state so that the higher ranking candidate could more easily win the next election. We found that this approach
created subtle issues around availability (a lower-ranked server might need to time out and become a candidate
again if a higher-ranked server fails, but if it does so too soon, it can reset progress towards electing a leader). We
made adjustments to the algorithm several times, but after each adjustment new corner cases appeared. Eventually
we concluded that the randomized retry approach is more obvious and understandable.

5.2*选举

Raft使用心跳机制触发领导者选举。服务器启动时,它们以跟随者的身份开始。一种 只要服务器接收到来自领导者或候选者的有效5个RPC,服务器就保持在跟随者状态。*定期发送 向所有关注者发送心跳信号(不携带日志条目的AppendEntriesRPC),以维持其权限。如果一个 追随者在称为选举超时的一段时间内未收到任何通信,则假定没有可行的通信 领导者并开始选举以选择新的领导者。

要开始选举,追随者会增加其当前任期并过渡到候选状态。然后投票支持 本身,并与集群中的每个其他服务器并行发出RequestVote RPC。候选人继续 这种状态,直到发生三件事之一:(a)赢得选举,(b)另一台服务器将自己确立为领导者,或者 (c)一段时间没有获胜者。这些结果将在下面的段落中单独讨论。

如果候选人从整个集群中的大多数服务器中获得相同票数的投票,则赢得选举 术语。在给定的期限内,每台服务器将最多投票一名候选人,以先到先得的方式进行(注: 5.4增加了对投票的额外限制)。多数规则确保最多一名候选人可以赢得 选择一个特定的术语(图3中的“选举安全属性”)。候选人赢得选举后, 成为领导者。然后,它将心跳消息发送到所有其他服务器以建立其权限并防止新的 选举。 在等待投票时,候选人可能会从声称是 领导。如果领导者的任期(包含在其RPC中)至少与候选人的当前任期一样大,则候选人 确认领导者为合法并返回跟随者状态。如果RPC中的字词小于候选人的 当前任期,则候选人拒绝RPC,并继续处于候选人状态。 第三种可能的结果是,候选人既不会赢得选举也不会失去选举:如果有许多追随者 候选人可以同时投票,因此没有候选人获得多数票。 候选人将超时并通过增加任期并发起另一轮请求来开始新的选举, 投票RPC。但是,如果不采取额外措施,分裂投票可以无限期地重复。 Raft使用随机的选举超时来确保分割票很少发生,并且可以快速解决。至 为了避免首先进行分割表决,选举超时是从固定间隔(例如150-300毫秒)中随机选择的。 这样会分散服务器,因此在大多数情况下,只有一台服务器会超时;它赢得选举并发送 在任何其他服务器超时之前发出心跳信号。使用相同的机制来处理拆分投票。每个候选人 在选举开始时重新启动其随机的选举超时,并等待该超时过去 开始下一次选举;这减少了在新选举中再次进行分裂表决的可能性。 9.3节显示 这种方法可以迅速选出一位领导者。 选举是可理解性如何指导我们在设计备选方案之间进行选择的一个示例。原来 我们计划使用排名系统:为每个候选人分配了唯一的排名,该排名用于选择 竞争候选人。如果某个候选人发现了另一个更高级别的候选人,它将返回给关注者 状态,以便排名较高的候选人可以更轻松地赢得下一次选举。我们发现这种方法 在可用性方面造成了细微的问题(排名较低的服务器可能需要超时并成为候选服务器 如果排名较高的服务器再次出现故障,但又过早发生故障,则可以重置选举领导者的进度)。我们 对该算法进行了多次调整,但是每次调整后,都会出现新的极端情况。最终 我们得出的结论是,随机重试方法更为明显和易于理解。

5.3 Log replication

Once a leader has been elected, it begins servicing client requests. Each client request contains a command to
be executed by the replicated state machines. The leader appends the command to its log as a new entry, then issues
AppendEntries RPCs in parallel to each of the other servers to replicate the entry. When the entry has been
safely replicated (as described below), the leader applies the entry to its state machine and returns the result of that
execution to the client. If followers crash or run slowly, or if network packets are lost, the leader retries Append-
Entries RPCs indefinitely (even after it has responded to the client) until all followers eventually store all log entries.

5.3日志复制

选举领导者后,它将开始为客户的请求提供服务。每个客户端请求都包含一个命令 由复制状态机执行。领导将命令作为新条目附加到其日志中,然后发出 与每个其他服务器并行的AppendEntries RPC,以复制条目。当输入 安全复制(如下所述)后,领导者将该条目应用于其状态机并返回该结果的结果。 执行到客户端。如果关注者崩溃或运行缓慢,或者丢失了网络数据包,则领导者会重试 无限期地输入RPC(即使它已响应客户端),直到所有关注者最终存储所有日志条目为止。

Logs are organized as shown in Figure 6. Each log entry stores a state machine command along with the term
number when the entry was received by the leader. The term numbers in log entries are used to detect inconsistencies
between logs and to ensure some of the properties in Figure 3. Each log entry also has an integer index identifying its position in the log.

日志的组织如图6所示。每个日志条目都存储一个状态机命令以及该术语。 组长收到条目时的编号。日志条目中的术语数字用于检测不一致 并确保图3中的某些属性。每个日志条目还具有一个整数索引,用于标识其在日志中的位置。

[etcd] Raft理论学习(二)(选举+复制日志)

The leader decides when it is safe to apply a log entry to the state machines; such an entry is called committed. Raft guarantees that committed entries are durable and will eventually be executed by all of the available
state machines. A log entry is committed once the leader that created the entry has replicated it on a majority of
the servers (e.g., entry 7 in Figure 6). This also commits all preceding entries in the leader’s log, including entries
created by previous leaders. Section 5.4 discusses some subtleties when applying this rule after leader changes,
and it also shows that this definition of commitment is safe. The leader keeps track of the highest index it knows
to be committed, and it includes that index in future AppendEntries RPCs (including heartbeats) so that the
other servers eventually find out. Once a follower learns that a log entry is committed, it applies the entry to its
local state machine (in log order).

领导者决定何时将日志条目应用于状态机是安全的;这样的条目称为已提交。 Raft保证提交的条目是持久的,并且最终将由所有可用条目执行 状态机。一旦创建条目的负责人已将日志条目复制到大多数 服务器(例如,图6中的条目7)。这还将提交领导者日志中的所有先前条目,包括条目 由前任*创建。第5.4节讨论了在更换领导者后应用此规则时的一些细微之处, 这也表明承诺的定义是安全的。领导者跟踪知道的最高指数 提交,并将其包含在将来的AppendEntries RPC(包括心跳)中,以便 其他服务器最终找到了。追随者得知日志条目已提交后,便将该条目应用于其 本地状态机(按日志顺序)。

We designed the Raft log mechanismto maintain a high level of coherency between the logs on different servers.
Not only does this simplify the system’s behavior and make itmore predictable, but it is an important component
of ensuring safety. Raft maintains the following properties, which together constitute the Log Matching Property
in Figure 3:

• If two entries in different logs have the same index and term, then they store the same command.
• If two entries in different logs have the same index and term, then the logs are identical in all preceding
entries.

我们设计了Raft日志机制来维持不同服务器上的日志之间的高度一致性。 这不仅简化了系统的行为并使其更加可预测,而且它是重要的组成部分 确保安全。 Raft维护以下属性,它们共同构成了Log Matching属性 在图3中:

•如果不同日志中的两个条目具有相同的索引和术语,则它们存储相同的命令。

•如果不同日志中的两个条目具有相同的索引和术语,则所有前面的日志均相同 条目。

The first property follows from the fact that a leader creates at most one entry with a given log index in a given
term, and log entries never change their position in the log. The second property is guaranteed by a simple consistency
check performed by AppendEntries.When sending an AppendEntries RPC, the leader includes the index
and term of the entry in its log that immediately precedes the new entries. If the follower does not find an entry in
its log with the same index and term, then it refuses the new entries. The consistency check acts as an induction
step: the initial empty state of the logs satisfies the Log Matching Property, and the consistency check preserves
the Log Matching Property whenever logs are extended. As a result, wheneverAppendEntries returns successfully,
the leader knows that the follower’s log is identical to its own log up through the new entries.

第一个属性来自以下事实:领导者最多在给定的给定日志索引下创建一个条目 术语,并且日志条目永远不会更改其在日志中的位置。第二个属性通过简单的一致性保证 发送AppendEntries RPC时,领导者会包含索引 以及其日志中紧接新条目的条目的术语。如果关注者在中找不到条目 它的日志使用相同的索引和术语,然后拒绝新条目。一致性检查充当归纳法 步骤:日志的初始空状态满足Log Matching属性,并保留一致性检查 扩展日志时,将使用“日志匹配属性”。结果,只要AppendEntries成功返回, 领导者知道追随者的日志与通过新条目进行的追随者日志相同。

During normal operation, the logs of the leader and followers stay consistent, so the AppendEntries consistency
check never fails. However, leader crashes can leave the logs inconsistent (the old leader may not have fully
replicated all of the entries in its log). These inconsistencies can compound over a series of leader and follower
crashes. Figure 7 illustrates the ways in which followers’logs may differ from that of a new leader. A follower may be missing entries that are present on the leader, it may have extra entries that are not present on the leader, or both. Missing and extraneous entries in a log may span multiple terms.

在正常操作期间,领导者和关注者的日志保持一致,因此AppendEntries保持一致 检查永远不会失败。但是,领导者崩溃可能会使日志不一致(旧领导者可能没有完全 复制其日志中的所有条目)。这些不一致会加剧一系列领导者和追随者 崩溃。图7说明了追随者的日志可能不同于新领导者的日志的方式。跟随者可能会丢失领导者上存在的条目,它可能具有领导者上不存在的额外条目或两者兼而有之。日志中的缺失条目和多余条目可能跨越多个term。

[etcd] Raft理论学习(二)(选举+复制日志)

In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that
conflicting entries in follower logs will be overwritten with entries from the leader’s log. Section 5.4 will show
that this is safe when coupled with one more restriction.

To bring a follower’s log into consistency with its own,the leader must find the latest log entry where the two
logs agree, delete any entries in the follower’s log after that point, and send the follower all of the leader’s entries
after that point. All of these actions happen in response to the consistency check performed by AppendEntries
RPCs. The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will
send to that follower.When a leader first comes to power, it initializes all nextIndex values to the index just after the
last one in its log (11 in Figure 7). If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency
check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries
the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When
this happens,AppendEntrieswill succeed, which removes any conflicting entries in the follower’s log and appends
entries fromthe leader’s log (if any). Once AppendEntries succeeds, the follower’s log is consistentwith the leader’s,and it will remain that way for the rest of the term.

在Raft中,领导者通过强迫追随者的日志重复自己的日志来处理不一致之处。这意味着 关注者日志中的冲突条目将被领导者日志中的条目覆盖。第5.4节将显示 当再加上一个限制时,这是安全的。

为了使追随者的日志与自己的日志保持一致,领导者必须找到最新的日志条目,其中两个 日志同意,在此之后删除关注者日志中的所有条目,并将领导者的所有条目发送给关注者 在那之后。所有这些操作都是响应AppendEntries执行的一致性检查而发生的 RPC。领导者为每个关注者维护一个nextIndex,这是领导者将要下一个日志条目的索引 当领导者上台时,它将所有nextIndex值初始化为索引之后的索引 日志中的最后一个(图7中的11)。如果关注者的日志与领导者的日志不一致,则AppendEntries的一致性 检查将在下一个AppendEntries RPC中失败。拒绝后,领导者递减nextIndex并重试 AppendEntries RPC。最终nextIndex将到达领导者和跟随者日志匹配的地步。什么时候 发生这种情况时,AppendEntries将成功,这将删除关注者日志中的所有冲突条目并追加 领导者日志中的条目(如果有)。一旦AppendEntries成功,追随者的日志便与领导者的日志保持一致,并且在本学期的其余时间中都将保持这种状态。

If desired, the protocol can be optimized to reduce the number of rejected AppendEntries RPCs. For example,
when rejecting an AppendEntries request, the follower can include the term of the conflicting entry and the first
index it stores for that term. With this information, the leader can decrement nextIndex to bypass all of the conflicting
entries in that term; one AppendEntries RPC will be required for each term with conflicting entries, rather
than one RPC per entry. In practice, we doubt this optimization is necessary, since failures happen infrequently
and it is unlikely that there will be many inconsistent entries.

如果需要,可以优化协议以减少拒绝的AppendEntries RPC的数量。例如, 在拒绝AppendEntries请求时,关注者可以包括冲突条目的术语和第一个 它为该词存储的索引。利用此信息,领导者可以减少nextIndex来绕过所有冲突的对象 该词条中的条目;每个词条有冲突的条目都需要一个AppendEntries RPC,而不是 每个条目最多一个RPC。实际上,我们怀疑这种优化是否必要,因为故障很少发生 并且不太可能会有很多不一致的条目。

With this mechanism, a leader does not need to take any special actions to restore log consistency when it comes to
power. It just begins normal operation, and the logs automatically converge in response to failures of the Append-
Entries consistency check. A leader never overwrites or deletes entries in its own log (the Leader Append-Only
Property in Figure 3).

This log replication mechanism exhibits the desirable consensus properties described in Section 2: Raft can accept,
replicate, and apply new log entries as long as a majority of the servers are up; in the normal case a new entry
can be replicated with a single round of RPCs to a majority of the cluster; and a single slow follower will not
impact performance.

通过这种机制,领导者无需采取任何特殊措施即可恢复日志一致性。它只是开始正常运行,并且日志会根据“附录- 条目一致性检查。领导者永远不会覆盖或删除自己日志中的条目(“领导者仅追加” 图3中的属性。 这种日志复制机制展现出第2节中所述的理想共识属性:筏可以接受, 只要大多数服务器都已启动,就复制并应用新的日志条目;通常情况下是一个新条目 可以通过单轮RPC复制到大多数集群;一个慢速追随者不会 影响性能。

5.4 Safety

The previous sections described how Raft elects leaders and replicates log entries. However, the mechanisms
described so far are not quite sufficient to ensure that each state machine executes exactly the same commands in the
same order. For example, a follower might be unavailable while the leader commits several log entries, then it could
be elected leader and overwrite these entries with new ones; as a result, different state machines might execute
different command sequences.

This section completes the Raft algorithm by adding a restriction on which servers may be elected leader. The
restriction ensures that the leader for any given term contains all of the entries committed in previous terms (the
Leader Completeness Property from Figure 3). Given the election restriction, we then make the rules for commitment
more precise. Finally, we present a proof sketch for the Leader Completeness Property and show how it leads
to correct behavior of the replicated state machine.

5.4安全

前面的部分描述了Raft如何选举领导者并复制日志条目。但是,机制 到目前为止描述的内容还不足以确保每个状态机在 相同的顺序。例如,当领导者提交多个日志条目时,跟随者可能不可用,那么它可能 当选*,并用新的条目覆盖这些条目;结果,可能会执行不同的状态机 不同的命令序列。 本节通过添加一个限制来选举Raft算法,从而完善了Raft算法。的 限制可确保任何给定术语的领导者都包含先前术语中提交的所有条目( 图3中的领导者完整性属性。给定选举限制,然后制定承诺规则 更精确。最后,我们给出领导力完整性属性的证明草图,并说明它如何导致 纠正复制状态机的行为。

5.4.1 Election restriction

In any leader-based consensus algorithm, the leader must eventually store all of the committed log entries. In
some consensus algorithms, such as Viewstamped Replication [22], a leader can be elected even if it doesn’t
initially contain all of the committed entries. These algorithms contain additional mechanisms to identify the
missing entries and transmit them to the new leader, either during the election process or shortly afterwards. Unfortunately,
this results in considerable additional mechanism and complexity. Raft uses a simpler approach where it guarantees that all the committed entries from previous terms are present on each new leader from the moment of its election, without the need to transfer those entries to the leader. This means that log entries only flow in one direction, from leaders to followers, and leaders never overwrite existing entries in their logs.

Raft uses the voting process to prevent a candidate from winning an election unless its log contains all committed
entries. A candidate must contact a majority of the cluster in order to be elected, which means that every committed
entrymust be present in at least one of those servers. If the candidate’s log is at least as up-to-date as any other log
in that majority (where “up-to-date” is defined precisely below), then it will hold all the committed entries. The
RequestVote RPC implements this restriction: the RPC includes information about the candidate’s log, and the
voter denies its vote if its own log is more up-to-date than that of the candidate.

5.4.1选举限制

在任何基于领导者的共识算法中,领导者最终必须存储所有提交的日志条目。在 一些共识算法(例如Viewstamped Replication [22]),即使没有 最初包含所有提交的条目。这些算法包含其他机制来识别 缺失的条目,然后在选举过程中或之后不久将其发送给新领导。不幸, 这导致相当大的附加机制和复杂性。 Raft使用一种更简单的方法,该方法可确保从当选新任领导之时起,就将前任所有已提交的承诺条目存在于每个新领导者中,而无需将这些条目转移给领导者。这意味着日志条目仅在一个方向上流动,从领导者到追随者,领导者永远不会覆盖其日志中的现有条目。 Raft使用投票程序来阻止候选人赢得选举,除非其日志包含所有已提交的 条目。候选人必须联系集群中的大多数才能当选,这意味着每位候选人 这些服务器中的至少一台必须存在条目。如果候选人的日志至少与其他任何日志一样最新 在大多数情况下(以下精确定义了“最新”),则它将保存所有提交的条目。的 RequestVote RPC实施了此限制:RPC包含有关候选人日志的信息,以及 如果自己的日志比候选人的日志最新,则投票人拒绝投票。

Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the
logs. If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs
end with the same term, then whichever log is longer is more up-to-date.

Raft通过比较索引中最后一个条目的索引和术语来确定两个日志中哪个是最新的 日志。如果日志中的最后一个条目具有不同的术语,则带有较新术语的日志将是最新的。如果日志 以相同的术语结尾,则以更长的日志为准。

[etcd] Raft理论学习(二)(选举+复制日志)

5.4.2 Committing entries from previous terms

As described in Section 5.3, a leader knows that an entry from its current term is committed once that entry is
stored on a majority of the servers. If a leader crashes before committing an entry, future leaders will attempt to
finish replicating the entry. However, a leader cannot immediately conclude that an entry from a previous term is
committed once it is stored on a majority of servers. Figure 8 illustrates a situation where an old log entry is stored
on a majority of servers, yet can still be overwritten by a future leader.

5.4.2提交先前条款中的条目

如第5.3节所述,领导者知道,一旦该词条被存储在大多数服务器上。如果领导者在提交条目之前崩溃,未来的领导者将尝试 完成复制条目。但是,领导者不能立即得出以下结论: 一旦存储在大多数服务器上就提交。图8说明了存储旧日志条目的情况 在大多数服务器上,但仍可以由将来的领导者覆盖。

To eliminate problems like the one in Figure 8, Raft never commits log entries from previous terms by counting
replicas. Only log entries from the leader’s current term are committed by counting replicas; once an entry
from the current term has been committed in this way, then all prior entries are committed indirectly because
of the Log Matching Property. There are some situations where a leader could safely conclude that an older log entry
is committed (for example, if that entry is stored on every server), but Raft takes a more conservative approach
for simplicity.

Raft incurs this extra complexity in the commitment rules because log entries retain their original term numbers
when a leader replicates entries from previous terms. In other consensus algorithms, if a new leader rereplicates
entries from prior “terms,” it must do so with its new “term number.” Raft’s approach makes it easier
to reason about log entries, since they maintain the same term number over time and across logs. In addition, new
leaders in Raft send fewer log entries from previous terms than in other algorithms (other algorithms must send redundant
log entries to renumber them before they can be committed).

为了消除如图8所示的问题,Raft永远不会通过计数来提交前项的日志条目 副本。通过计算副本数,仅提交领导者当前任期的日志条目;一旦进入 从当前术语中的提交已经以这种方式提交,那么所有先前的条目都是间接提交的,因为 日志匹配属性。在某些情况下,领导者可以安全地断定较旧的日志条目 提交(例如,如果该条目存储在每台服务器上),但是Raft采用了更为保守的方法 为简单起见。 由于日志条目保留其原始条款编号,因此Raft会在承诺规则中产生这种额外的复杂性 领导者复制以前条款中的条目时。在其他共识算法中,如果新的领导者重复 先前“条款”中的条目,则必须使用其新的“条款编号”。 Raft的方法使操作更轻松 之所以要说明日志条目,是因为它们随着时间和跨日志保持相同的术语编号。另外,新 与其他算法相比,Raft中的领导者发送的前项日志条目要少(其他算法必须发送多余的日志条目) 日志条目以对其重新编号,然后才能提交它们)。

5.4.3 Safety argument

Given the complete Raft algorithm, we can now argue more precisely that the Leader Completeness Property
holds (this argument is based on the safety proof; see Section 9.2). We assume that the Leader Completeness
Property does not hold, then we prove a contradiction. Suppose the leader for term T (leaderT) commits a log
entry from its term, but that log entry is not stored by the leader of some future term. Consider the smallest term U
> T whose leader (leaderU) does not store the entry. 

5.4.3安全论证

给定完整的Raft算法,我们现在可以更精确地争论领导者完整性属性 成立(此论点基于安全证明;请参见第9.2节)。我们假设领导者的完整性 财产不成立,那么我们证明了矛盾。假设术语T的领导者(leaderT)提交日志 条目,但该日志条目不由某个将来术语的负责人存储。考虑最小项U >其领导者(leaderU)不存储条目的T。

1. The committed entry must have been absent from leaderU’s log at the time of its election (leaders never delete or overwrite entries).

2. leaderT replicated the entry on a majority of the cluster,and leaderU received votes from a majority of the cluster. Thus, at least one server (“the voter”)both accepted the entry from leaderT and voted for leaderU, as shown in Figure 9. The voter is key to reaching a contradiction.

1.提交的条目必须在当选的LeaderU日志中不存在(领导者永远不会删除或覆盖条目)。

2. LeaderT在大多数集群中复制了该条目,leaderU从大多数集群中获得了投票。因此,至少有一个服务器(“投票者”)都接受了leaderT的条目并投票给leaderU,如图9所示。投票者是达成矛盾的关键。

[etcd] Raft理论学习(二)(选举+复制日志)

3. The voter must have accepted the committed entry from leaderT before voting for leaderU; otherwise it would have rejected the AppendEntries request from leaderT (its current termwould have been higher thanT).

4. The voter still stored the entry when it voted for leaderU, since every intervening leader contained the entry (by assumption), leaders never remove entries, and followers only remove entries if they conflict with the leader.
5. The voter granted its vote to leaderU, so leaderU’s log must have been as up-to-date as the voter’s. This leads to one of two contradictions.
6. First, if the voter and leaderU shared the same last log term, then leaderU’s log must have been at least as long as the voter’s, so its log contained every entry in the voter’s log. This is a contradiction, since the voter contained the committed entry and leaderU was assumed not to.
7. Otherwise, leaderU’s last log term must have been larger than the voter’s. Moreover, it was larger than T, since the voter’s last log termwas at least T (it contains the committed entry from term T). The earlier leader that created leaderU’s last log entry must have contained the committed entry in its log (by assumption). Then, by the LogMatching Property, leaderU’s log must also contain the committed entry, which is a contradiction.
8. This completes the contradiction. Thus, the leaders of all terms greater than T must contain all entries from term T that are committed in term T.

9. The Log Matching Property guarantees that future leaders will also contain entries that are committed indirectly, such as index 2 in Figure 8(d).

3.投票者必须在对LeaderU进行投票之前已经接受了LeaderT提交的条目;否则,它将拒绝来自LeaderT的AppendEntries请求(其当前条件将高于T)。

4.投票者投票投票给leaderU时,投票者仍然存储该条目,因为每个干预的领导者都包含该条目(假设),领导者从不删除条目,而追随者仅在与领导者冲突时才删除条目。

5.投票人将投票结果授予了leaderU,因此leaderU的日志必须与投票人的日志一样最新。这导致两个矛盾之一。

6.首先,如果选民和LeaderU共享相同的最后一个日志条款,那么LeaderU的日志必须至少与选民的日志一样长,因此其日志包含选民日志中的每个条目。这是一个矛盾,因为选民中包含已提交的条目,而leaderU被假定为不包含。

7.否则,leaderU的上一个登录词必须大于选民的词。此外,它比T大,因为选民的上一个对数术语至少为T(它包含来自T的承诺条目)。创建leaderU的最后一个日志条目的较早的领导者必须在其日志中包含提交的条目(通过假设)。然后,通过LogMatching属性,leaderU的日志还必须包含已提交的条目,这是矛盾的。

8.这样就完成了矛盾。因此,所有大于T的项的领导者必须包含在T项中提交的所有T项。

9.日志匹配属性保证未来的领导者还将包含间接提交的条目,例如图8(d)中的索引2。

Given the Leader Completeness Property, we can prove the State Machine Safety Property from Figure 3, which
states that if a server has applied a log entry at a given index to its state machine, no other server will ever apply a
different log entry for the same index. At the time a server applies a log entry to its state machine, its log must be
identical to the leader’s log up through that entry and the entry must be committed. Now consider the lowest term
in which any server applies a given log index; the Log Completeness Property guarantees that the leaders for all
higher terms will store that same log entry, so servers that apply the index in later terms will apply the same value.
Thus, the State Machine Safety Property holds.

给定领导者完整性属性,我们可以从图3证明状态机安全性属性, 指出如果服务器已在其状态机上将给定索引的日志条目应用于其状态机,则其他任何服务器都将不会应用 同一索引的不同日志条目。服务器将日志条目应用于其状态机时,其日志必须为 与领导者通过该条目进行的登录相同,并且必须提交该条目。现在考虑最低期限 任何服务器在其中应用给定的日志索引;日志完整性属性可确保所有人的领导者 较高的术语将存储相同的日志条目,因此在以后的术语中应用索引的服务器将应用相同的值。 因此,状态机安全属性成立。

Finally, Raft requires servers to apply entries in log index order. Combined with the StateMachine Safety Property, this means that all servers will apply exactly the same set of log entries to their state machines, in the same order.

最后,Raft要求服务器以日志索引顺序应用条目。结合StateMachine安全属性,这意味着所有服务器将以相同的顺序将完全相同的日志条目集应用于其状态机。

5.5 Follower and candidate crashes
Until this point we have focused on leader failures. Follower and candidate crashes are much simpler to handle than leader crashes, and they are both handled in the same way. If a follower or candidate crashes, then future RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these failures by retrying indefinitely; if the crashed server restarts, then the RPC will complete successfully. If a server crashes after completing an RPC but before responding, then it will receive the same RPC again after it restarts. Raft RPCs are idempotent, so this causes no harm. For example, if a follower receives an AppendEntries request that includes log entries already present in its log, it ignores those entries in the new request.

5.5追随者和候选者崩溃

到目前为止,我们只关注领导者的失败。跟随者和候选者崩溃比领导者崩溃更容易处理,并且两者的处理方式相同。如果关注者或候选者崩溃,则将来发送给它的RequestVote和AppendEntries RPC将失败。 Raft通过无限期重试来处理这些故障;如果崩溃的服务器重新启动,则RPC将成功完成。如果服务器在完成RPC之后但在响应之前崩溃了,那么它将在重新启动后再次收到相同的RPC。筏式RPC是幂等的,因此不会造成伤害。例如,如果关注者收到一个AppendEntries请求,其中包括其日志中已经存在的日志条目,则它会忽略新请求中的那些条目。

5.6 Timing and availability

One of our requirements for Raft is that safety must not depend on timing: the system must not produce incorrect results just because some event happensmore quickly or slowly than expected. However, availability (the ability of the system to respond to clients in a timely manner) must inevitably depend on timing. For example, if message exchanges take longer than the typical time between server crashes, candidates will not stay up long enough to win an election; without a steady leader, Raft  cannot make progress.

5.6时间和可用性

我们对Raft的要求之一是安全性不得取决于时间安排:系统不得仅因为某些事件比预期的快或慢发生而产生错误的结果。但是,可用性(系统及时响应客户端的能力)必须不可避免地取决于时间。例如,如果消息交换花费的时间比服务器崩溃之间的典型时间长,则候选人的停留时间不会足够长,无法赢得选举。没有稳定的领导者,Raft无法取得进步。

Leader election is the aspect of Raft where timing is most critical. Raft will be able to elect and maintain a steady leader as long as the system satisfies the following timing requirement:

*选举是Raft最重要的方面,时机至关重要。只要系统满足以下时序要求,Raft将能够选举和维持稳定的领导者:

broadcastTime≪electionTimeout≪MTBF

In this inequality broadcastTime is the average time it takes a server to send RPCs in parallel to every server
in the cluster and receive their responses; electionTime-out is the election timeout described in Section 5.2; and
MTBF is the average time between failures for a single server. The broadcast time should be an order of magnitude
less than the election timeout so that leaders can reliably send the heartbeat messages required to keep followers
from starting elections; given the randomized approach used for election timeouts, this inequality also
makes split votes unlikely. The election timeout should be  a few orders of magnitude less thanMTBF so that the system
makes steady progress. When the leader crashes, the system will be unavailable for roughly the election timeout;
we would like this to represent only a small fraction of overall time.

在这种不平等的情况下,broadcastTime是服务器向每个服务器并行发送RPC所花费的平均时间 在集群中并收到他们的回应; lectionTime-out是第5.2节所述的选举超时;和 MTBF是单个服务器发生故障之间的平均时间。广播时间应该是一个数量级 小于选举超时时间,以便领导者可以可靠地发送保持跟随者所需的心跳消息 从开始选举;考虑到用于选举超时的随机方法,这种不平等也 使得分裂投票的可能性不大。选举超时应该比MTBF小几个数量级,以便系统 稳步前进。领导者崩溃时,该系统将在大约选举超时时间内不可用; 我们希望这只代表总时间的一小部分。

The broadcast time and MTBF are properties of the underlying system, while the election timeout is something
we must choose. Raft’s RPCs typically require the recipient to persist information to stable storage, so the broadcast
time may range from 0.5ms to 20ms, depending on storage technology. As a result, the election timeout is likely to be somewhere between 10ms and 500ms. Typical server MTBFs are several months or more, which easily satisfies the timing requirement.

广播时间和MTBF是基础系统的属性,而选举超时是 我们必须选择。Raft的RPC通常要求收件人将信息持久保存到稳定的存储中,因此广播 时间可能从0.5毫秒到20毫秒不等,具体取决于存储技术。结果,选举超时可能在10毫秒至500毫秒之间。典型的服务器MTBF长达数月或更长时间,可以轻松满足计时要求。