System Design Lessons from Scaling at Amazon

Operating at scale doesn't teach you new principles. It teaches you which ones actually matter.

A lot of what I learned at Amazon is stuff you'll find in any distributed systems book. The difference is that at scale, you stop being able to treat it as theory. The system tells you when you're wrong, and it tells you loudly.

Here are four things I keep coming back to.

Design for failure you haven't seen yet.

At scale, the question isn't whether a dependency will fail. It's when, and what happens to everything downstream when it does. The teams that handle incidents well aren't the ones with the fewest failures. They're the ones who assumed failure was coming and designed around it.

Every external call gets a timeout. Every queue gets a dead-letter path. Every downstream service is treated as untrusted until you have reason to think otherwise. This isn't pessimism. It's just experience.

The mistake is building for the happy path and adding resilience later. It's always more expensive that way.

You cannot reason about a system you cannot see.

Most teams underinvest in observability until something breaks in production and they can't explain why. By then the cost is real. Not just the incident itself, but the time spent fumbling in the dark trying to understand a system nobody fully mapped.

Structured logs you can query. Metrics with enough granularity to isolate a problem. Traces that let you follow a request across service boundaries. If you can't answer "what is this system doing right now, and why" within five minutes of a page, that's the gap to close first.

One more thing: instrument before you optimize. If you don't know where your latency is coming from, you'll fix the wrong thing.

Consistency is a choice, not a default.

Strong consistency is something you opt into, not something every piece of data deserves. Your shopping cart can tolerate eventual consistency. Your payment ledger cannot. Treating all data the same, usually with the strictest guarantees, is how you end up with a slow and fragile system that's hard to change.

A useful exercise: go through your data model and ask, for each entity, what is the worst thing that happens if this is briefly stale? The answers are usually more permissive than you expect, and they open up a lot of design space.

Complexity is a liability.

The most impressive systems I've worked with are boring. Well-understood primitives. Clear boundaries. Operated by people who can reason about them without reading a wiki first.

The temptation, especially for strong engineers, is to reach for sophisticated solutions. Microservices when a monolith would do. Event sourcing when a table would do. Distributed caching when the query is already fast. Each addition seems justified in isolation. Together they compound into something nobody fully understands.

Before introducing anything new, the question I ask is simple: what problem does this solve that I can't solve with what I already have? If the answer isn't crisp, it's not ready.

These are the kinds of questions that come up in almost every architecture conversation I have with clients. If you're working through a design decision and want a second opinion, get in touch.