← Back
Pragmatic Engineer

Reliability fail: No automated zone failover for Coinbase’s global trading service

6 min read
#deployment#compute#enterprise
Reliability fail: No automated zone failover for Coinbase’s global trading service
Level:Advanced
For:Cloud Engineers, Distributed Systems Engineers
TL;DR

Coinbase's global trading service experienced a 10-hour outage due to a regional AWS outage, revealing the company's dependency on a single AWS zone. The outage was caused by the lack of automated zone failover, which led to the loss of quorum when three of five matching-engine nodes went down. Coinbase's postmortem revealed that the company deliberately chose to run its matching engine in a single availability zone to meet latency and throughput demands. The practical implication for engineers building AI systems is to consider the tradeoffs between latency, throughput, and availability when designing distributed systems.

⚡ Key Takeaways

  • Coinbase's matching engine was pinned to a single building, running as a Raft-based replicated cluster inside an AWS Cluster Placement Group.
  • The company lacked an automated ability to fail over to another availability zone, leading to a 10-hour outage.
  • Running from more than one availability zone would introduce too much latency to Coinbase's product, but preparing for a failover is crucial.
  • Recovery required an emergency code change, creation of a new node group, and a careful sequence to restore a 3-of-5 quorum.

🔧 Tools & Libraries

AWSEC2
💡 Why It Matters

The outage highlights the importance of designing distributed systems with availability and failover in mind, even if it means introducing additional latency. Engineers building AI systems must consider the tradeoffs between latency, throughput, and availability to ensure high uptime and reliability.

✅ Practical Steps

  1. Consider the tradeoffs between latency, throughput, and availability when designing distributed systems.
  2. Implement automated zone failover to ensure high uptime and reliability.
  3. Use AWS Cluster Placement Groups to run replicated clusters, but also prepare for failover scenarios.

Want the full story? Read the original article.

Read on Pragmatic Engineer

More like this

Anthropic launches Claude Tag, replacing its Slack app with a persistent AI teammate that learns, monitors and works autonomously

VentureBeat AI#anthropic

Build a protein research copilot with Amazon Bedrock AgentCore

AWS ML Blog#agents

How Businesses Are Building Specialized AI They Can Trust

NVIDIA Blog#agents

New chip could help tiny robots traverse complex environments

MIT News AI#compute

EXPLORE AI NEWS

Daily hand-picked stories on LLMs, RAG, agents and production AI — curated for engineers who ship.

BROWSE NEWS

GET THE WEEKLY DIGEST

Join engineers getting the Monday signal-over-noise AI breakdown. No spam, unsubscribe anytime.

LEARN AI ENGINEERING

Curated courses, research papers, repos and tutorials built for engineers leveling up in AI.

START LEARNING