Providing full data availability in a degraded cluster requires
a mechanism for coordinating changes to cluster membership in an automated way.
While developing our GPL distributed storage solution we researched
the state of the art of consensus algorithms to achieve fault-tolerance.
Our first stop was Paxos, which turned out to be complex and lacking accurate and comprehensive
specifications. Without a reference implementation, confusion reigns supreme, and
analyzing all the different variants would have taken us at least 3 months.
By contrast Raft defines a single clean way to implement reliable leader
election. It controls responsiveness with autonomously made decisions of
participation in the cluster, while providing data availability.
In this lightning talk I will recount the road bumps we hit while implementing
the consensus algorithm and the headaches we had before we ditched Paxos.
With this head start on Raft, I hope to save my fellow developers from the pain
we went through!