Pragmatic Engineer

Reliability fail: No automated zone failover for Coinbase’s global trading service

June 23, 2026•6 min read•

Level:Advanced

For:Cloud Engineers, Distributed Systems Engineers

✦TL;DR

Coinbase's global trading service experienced a 10-hour outage due to a regional AWS outage, revealing the company's dependency on a single AWS zone. The outage was caused by the lack of automated zone failover, which led to the loss of quorum when three of five matching-engine nodes went down. Coinbase's postmortem revealed that the company deliberately chose to run its matching engine in a single availability zone to meet latency and throughput demands. The practical implication for engineers building AI systems is to consider the tradeoffs between latency, throughput, and availability when designing distributed systems.

⚡ Key Takeaways

Coinbase's matching engine was pinned to a single building, running as a Raft-based replicated cluster inside an AWS Cluster Placement Group.
The company lacked an automated ability to fail over to another availability zone, leading to a 10-hour outage.
Running from more than one availability zone would introduce too much latency to Coinbase's product, but preparing for a failover is crucial.
Recovery required an emergency code change, creation of a new node group, and a careful sequence to restore a 3-of-5 quorum.

🔧 Tools & Libraries

AWSEC2

💡 Why It Matters

The outage highlights the importance of designing distributed systems with availability and failover in mind, even if it means introducing additional latency. Engineers building AI systems must consider the tradeoffs between latency, throughput, and availability to ensure high uptime and reliability.

✅ Practical Steps

Consider the tradeoffs between latency, throughput, and availability when designing distributed systems.
Implement automated zone failover to ensure high uptime and reliability.
Use AWS Cluster Placement Groups to run replicated clusters, but also prepare for failover scenarios.

Want the full story? Read the original article.

Read on Pragmatic Engineer ↗

Reliability fail: No automated zone failover for Coinbase’s global trading service

⚡ Key Takeaways

🔧 Tools & Libraries

✅ Practical Steps

More like this

Anthropic launches Claude Tag, replacing its Slack app with a persistent AI teammate that learns, monitors and works autonomously

Build a protein research copilot with Amazon Bedrock AgentCore

How Businesses Are Building Specialized AI They Can Trust

New chip could help tiny robots traverse complex environments