Changes to prevent deadlocks and loops on elections #7
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi, Pedro. I'm using the skiff-algorithm in a project and I had some situations when I was restarting the nodes where they would enter loops or deadlocks. I did a few changes that maybe you would like to incorporate. The following text is basically what I wrote in my commit to explain the changes.
As the candidates first vote to themselves, when all nodes are candidates,
the likely result of the election is a draw. Converting to follower after a failed election
increases the chance that the next election won't result in a draw.
If a leader doesn't forget that it voted to himself, this can prevent
candidates to step in as State.onRequestVote() [state.js] will always
deny the vote to other candidates than the leader himself. As the
leader is not a candidate by definition, this could result in a
deadlock. The same problem applies to the follower. The simplest
solution that I found was to remember the vote's term, so if a newer
term is seen, the node can forget its previous vote.
During an election, if the result of a vote request is 'not granted'
with reason 'too soon', another node has seen the leader alive, this
is a hint that the candidate should convert to follower.
When a node converts to follower, it may take some extra time
to setup the connection and the event handlers to receive an
initial heart beat from the leader, that could move the node to
candidate because of the timeout. If that node is behind in
the log, it can't be elected, which could let it stuck as candidate
(or move it back to follower after the change mentioned above,
this in turn, could result in a loop).
When a leader get's a timeout from a peer, it may be necessary to
reset the connection.