Reliability fail: No automated zone failover for Coinbase’s global trading service
Coinbase's global trading service experienced a 10-hour outage due to a regional AWS outage, revealing the company's dependency on a single AWS zone. The outage was caused by the lack of automated zone failover, which led to the loss of quorum when three of five matching-engine nodes went down. Coinbase's postmortem revealed that the company deliberately chose to run its matching engine in a single availability zone to meet latency and throughput demands. The practical implication for engineers building AI systems is to consider the tradeoffs between latency, throughput, and availability when designing distributed systems.
⚡ Key Takeaways
- Coinbase's matching engine was pinned to a single building, running as a Raft-based replicated cluster inside an AWS Cluster Placement Group.
- The company lacked an automated ability to fail over to another availability zone, leading to a 10-hour outage.
- Running from more than one availability zone would introduce too much latency to Coinbase's product, but preparing for a failover is crucial.
- Recovery required an emergency code change, creation of a new node group, and a careful sequence to restore a 3-of-5 quorum.
🔧 Tools & Libraries
The outage highlights the importance of designing distributed systems with availability and failover in mind, even if it means introducing additional latency. Engineers building AI systems must consider the tradeoffs between latency, throughput, and availability to ensure high uptime and reliability.
✅ Practical Steps
- Consider the tradeoffs between latency, throughput, and availability when designing distributed systems.
- Implement automated zone failover to ensure high uptime and reliability.
- Use AWS Cluster Placement Groups to run replicated clusters, but also prepare for failover scenarios.
Want the full story? Read the original article.
Read on Pragmatic Engineer ↗