Ok, I don’t particularly like calling a bug fantastic, in this case, it is more of a fantastic troubleshooting of a bug. What I found interesting was the layers that were unpeeled one by one to reach the probable region of the root cause. (Yeah, the root cause is probably so esoteric and confined to a specific combination of version, that it is unlikely to be looked at by anybody).
Here is Pagerduty’s summary of the bug.
After more than a month of tireless research and testing, we have finally got to the bottom of our ZooKeeper mystery. Corruption during AES encryption in Xen v4.1 or v3.4 paravirtual guests running a Linux 3.0+ kernel, combined with the lack of TCP checksum validation in IPSec Transport mode, which leads to the admission of corrupted TCP data on a ZooKeeper node, resulting in an unhandled exception from which ZooKeeper is unable to recover. Jeez. Talk about a needle in a haystack… Even after all this, we are still unsure where precisely the bug lies. Despite that fact, we’re still pretty satisfied with the outcome of the investigation. Now all we need to do is work around it.