Introduction
High Availability (HA) is the art of building computer systems that deliver uninterrupted service, even during unexpected failures. In 2026, with the rise of microservices, hybrid clouds, and mission-critical workloads like real-time AI, aiming for less than 5 minutes of downtime per year (99.999% uptime) isn't optional—it's a business imperative.
Why does it matter? A one-hour outage for an e-commerce site can cost 1 million euros (Gartner 2025 study). This advanced tutorial explores the deep theory: from SLI/SLO/SLA to multi-region architectures and chaos engineering. No code here, but actionable conceptual frameworks, spot-on analogies (like a redundant symphony orchestra), and real cases (Netflix, AWS Outages). You'll walk away with a mental blueprint to audit and upgrade any existing system.
Goal: Shift from reactive to proactive resilience, quantifying every decision with precise metrics.
Prerequisites
- Strong grasp of distributed systems (CAP Theorem, consensus algorithms like Raft).
- Experience with cloud platforms (AWS, GCP, Azure) and containers (Kubernetes).
- Knowledge of observability tools (Prometheus, Jaeger, ELK).
- Familiarity with reliability metrics (MTTR, MTBF).
1. Theoretical Foundations: SLI, SLO, and SLA
It all starts with measurement. A SLI (Service Level Indicator) quantifies health: HTTP request success rate (e.g., >99.9% over a 5-minute rolling window), like a thermometer for an engine.
A SLO (Service Level Objective) sets the internal target: "95% of requests <200ms over 30 days." Analogy: It's your personal contract with the team—binding but not punitive.
Finally, the SLA (Service Level Agreement) is contractual: "99.5% uptime or 10% revenue penalty." Real-world example: Google's SRE uses an Error Budget = 100% - SLO. If it's exhausted, halt new features to prioritize stability.
| Metric | Example Formula | Typical SLO Threshold |
|---|---|---|
| ---------- | ---------------- | ------------------- |
| Availability | (up time / total) *100 | 99.99% |
| P95 Latency | 95th percentile | <500ms |
| Error Rate | errors / total | <0.1% |
2. Redundancy Architectures: Active-Active vs Active-Passive
Active-Passive (cold failover): One primary active node, secondaries on standby. Pro: Simplicity. Con: Switchover time ~30s-5min (high MTTR). Ideal for databases like PostgreSQL with streaming replication.
Active-Active (hot load balancing): All nodes handle traffic. Uses consistent hashing for session stickiness. Example: NGINX API Gateway + Consul for health checks.
Case study: AWS Elastic Load Balancing (ALB) across multiple AZs. During the 2021 US-East-1 Outage, cross-region Active-Active systems maintained 99.99% vs 99.9% single-region.
Design checklist:
- N+1 sizing: Capacity = peak load + 1 instance.
- Network affinity: Same subnet for <1ms latency.
- Quorum reads/writes: (N/2)+1 for consistency.
Next step: Target Active-Active for >99.99%.
3. Advanced Strategies: Multi-Region and Data Replication
Beyond AZs (Availability Zones), global HA demands multi-region setups. RPO (Recovery Point Objective) <1min for synchronous (e.g., Galera Cluster), <15min for asynchronous (MySQL async repl).
Circuit Breaker pattern: Prevents cascading failures (e.g., Hystrix/Resilience4j). If downstream >50% failures, open circuit → fallback.
Leader Election: Via etcd/ZooKeeper with ephemeral leases (TTL 10s). Analogy: Presidential election with automatic recount if quorum lost.
Netflix case study: Chaos Monkey + Spinnaker for cross-region canary deployments. Result: 99.999% since 2015, despite 100+ simulated failures/day.
Decision framework:
- Assess blast radius per component.
- Implement geo-routing (Route53 latency-based).
- Test RTO <60s with traffic steering.
4. Observability and Chaos Engineering
Golden Signals (Google SRE): Latency, Traffic, Errors, Saturation. Trace with OpenTelemetry: distributed spans for root cause in <5min.
Chaos Engineering: Inject faults (pod kills, network partitions). Tools: LitmusChaos, Gremlin. Hypothesis: "If I kill 33% of replicas, does SLO hold?"
Real example: LinkedIn injects 1% packet loss weekly → MTTR down 40%.
Dashboard example (Markdown for Grafana):
| Signal | Alert Threshold | Auto Action |
|---|---|---|
| -------- | -------------- | ------------- |
| CPU >90% | 5min | Scale out |
| Errors >1% | 1min | Open circuit |
Best Practices
- Diversity Principle: Avoid single-vendor lock-in (e.g., EKS + GKE fallback).
- Immutable Infrastructure: Blue-green deployments for zero-downtime.
- Automate Failover: Pacemaker/Corosync for <10s switch.
- Quantify Everything: SLO tree for decomposition (frontend SLO = 0.9 * backend).
- Blameless Post-Mortems: 70% time on preventive actions (like Uber).
Common Pitfalls to Avoid
- Shared Fate Fallacy: Everything in one AZ → total outage (e.g., CapitalOne 2019).
- Silent Failures: No health checks → traffic to zombie nodes.
- Stateful Ignorance: Lost sessions on failover without sticky sessions or Redis.
- Over-Engineering: 5x9 HA for a blog → 10x costs with no ROI.
Further Reading
Dive into official resources:
Expert training: Check out our Learni courses on cloud-native architecture. Recommended certifications: CKAD, AWS Solutions Architect Professional.