Patterns for Scalability and Reliability in Systems¶

This document covers the basic techniques and architectures for solving common scalability and reliability problems in systems.

Overview¶

When building modern systems, we need to consider how they'll perform under load and how to maintain reliability. This guide introduces common patterns that help address these challenges, organized into the following sections:

Client Server Architecture
Scalability Patterns
Limitations
Extending Client Server Architecture
Availability/Reliability Patterns

Client Server Architecture¶

The client-server model is the foundation of most web applications and services. Key concepts include:

Client-server communication
RESTful API design
Stateful vs. stateless services

Client-Server Architecture

Scalability Patterns¶

As systems grow, they face various bottlenecks. Here are common patterns to address these issues:

1. Load Balancing¶

Problem: I have too many requests! My single server can't take it anymore 😭

Solution: Horizontal scaling with a load balancer -- distribute requests across multiple servers

Load Balancer

2. Caching¶

Problem: My database is slow and can't handle all these reads 😩

Solution: Cache frequently/recently accessed data to reduce database load

Caching Mechanism

3. Database Sharding/Partitioning¶

Problem: My database is massive and can't handle all these writes 😭

Solution: Split the database into smaller, more manageable pieces. Designate a partition key to determine which shard to write to.

Database Sharding

4. Queueing¶

Problem: My system is overwhelmed by bursty traffic and can't process requests fast enough 😩

Solution: Use a message queue to manage requests and process them asynchronously

Message Queue

Limitations¶

Even with these patterns, various components have inherent limitations that we need to be aware of:

SQL Databases¶

Problem: SQL databases have limitations on scalability due to ACID properties 😞

1,000 writes/sec - Use more than one instance (sharding)
10,000 reads/sec - Use read replicas
1 TB (1000 GB) - Use partitioning
100 million records - Standard B-tree indexing not as effective

Solution: Add replicas, shard, or use a different database

NoSQL Databases¶

Many NoSQL databases scale horizontally and efficiently by default 🥳
Tradeoff ACID and indexing capabilities for effectively unlimited scalability with the right schema design
E.g. Cassandra, MongoDB, DynamoDB

Networks¶

Problem: At scale, network latency and throughput can become a bottleneck 😱

50 ms - Good latency over the internet
10 ms - Good latency within a data center
1 Gbps - Start thinking about multiple servers (or just network interfaces)

Solution: Add more servers and route network traffic, use CDNs/edge caching

Single Hosts¶

Problem: Single hosts have limitations on CPU, memory, and disk I/O 😡

100 GB working set of data in memory - Consider sharding or partitioning
1 GB data to cache - Consider using a distributed cache
10,000 requests/sec - Consider adding more servers

Solution: Partition for write-heavy workloads, cache data upstream for read-heavy workloads to decrease requests

Extending Client Server Architecture¶

As systems grow more complex, we need to extend beyond basic client-server models:

Service-Oriented Architecture (SOA)¶

Problem: My monolithic architecture is hard to maintain and scale 😖

Solution: Break down the monolith into smaller, more manageable services. Each service is responsible for a specific task and can be scaled independently.

Service-Oriented Architecture

API Gateway¶

Problem: I have multiple services which are non-uniform, and clients need to access them all 😩

Solution: Use an API Gateway to route requests to the appropriate service

API Gateway

Availability/Reliability Background¶

Understanding reliability is critical for building robust systems:

Availability: System is operational and accessible
Reliability: System performs as expected under normal conditions

Why do we need a reliable system?

99.999% uptime = 5.26 minutes of downtime per year
With 100 services, each with 99.999% uptime, the system could be down for 8.76 hours per year (assuming everything is a single point of failure)

To put it into perspective, AWS EC2's Service Level Agreement is 99.99% uptime, which allows for 52.56 minutes of downtime per year.

So building on top of cloud services still requires going the extra mile to ensure reliability.

Reliability Patterns¶

Here are key patterns to improve system reliability:

1. Replication¶

Problem: I have a stateful, purpose-built service that needs to be highly available 😅

Solution: Primary forwards writes to replicas, which can take over if the primary fails

Replication

2. Circuit Breaker¶

A circuit breaker helps isolate failures to prevent them from cascading through your system:

Prevents system failures from cascading
Monitors and isolates failing components

Circuit Breaker

3. Graceful Degradation¶

Design systems to maintain (at least) partial functionality during failures: - If a service is down, show cached data - If a service is overloaded, timeout gracefully and retry later

Limit hard dependencies and keep services decoupled to enable this resilience.