Developing and maintaining distributed systems like Hadoop is difficult.
The difficulty comes from many factors, but we believe that one of the most important reasons is lacking of a good debugger for bugs specific to distributed systems. (e.g., non-deterministic hardware faults, message ordering, ..)
In the talk, we will show Earthquake, our open-source debugging framework for distributed systems.
Earthquakes permutes Ethernet packets, Filesystem events, Java/C function calls, and injected faults in various orders so as to control non-determinism in the cluster.
Basically, Earthquake permutes events in a random order, but the user can write his/her own state exploration policy (in Go language) for finding deep bugs efficiently.
Earthquake also controls non-determinism of the thread interleaving by calling sched_setattr(2) with randomized parameters.
We will also share our successful stories about testing some Hadoop components with Earthquake.
For ZooKeeper, we found a distributed race condition bug which decreases availability of a ZooKeeper cluster.
We also reproduced a known ZooKeeper bug that no one had successfully reproduced for 2 years, and analyzed its cause.
For YARN, we found a disk-fault tolerance bug that inappropriately marks faulty node as healthy.
We also found bugs of non-Hadoop softwares, such as etcd.
With Earthquake, you can also test your real distibuted systems without any modification.