Checklist · Reliability

Production readiness checklist.

Published 13 Jun 2026Updated 17 Jul 2026

A production readiness checklist is the set of questions you answer before a service ships - is it reliable, can we see when it breaks, and can we recover it? This is a practical, vendor-neutral version you can work through to find the gaps before your users do.

The idea

What production readiness means

"Ready for production" is not the same as "the feature works." It means the service can be operated - observed, scaled, and recovered - by an on-call engineer who did not write it.

Most incidents come from boring gaps, not exotic bugs

Most production incidents are not caused by exotic bugs. They are caused by the ordinary things that were never set up: no alert fired, no runbook existed, the rollback was untested, the backup had never been restored. A readiness review is the cheap, boring step that catches those gaps while they are still easy to close.

Run it as a checklist, not a vibe

Walk each dimension below, answer honestly, and treat every no as either work to do before launch or a risk you are accepting on purpose and writing down.

The checklist

Seven dimensions, each with a handful of concrete checks. They are deliberately specific - a check you can verify, not a principle you can nod along to. Scan the whole standard first, then work the interactive list below to score your own service.

The seven production readiness dimensions, what "ready" means for each, and the check that fails most often.
Dimension	What "ready" means	Check that fails most often
Reliability & resilience	Every dependency call is bounded and one failure degrades the service instead of cascading.	A timeout on every outbound call - unbounded waits are the usual root cause.
Observability	You can see error rate, latency, and traces without SSH-ing into a box.	Every alert linking to a runbook - an alert with no next step gets muted.
CI/CD & deploys	Shipping is automated, gated on tests, and reversible in one step.	A rollback that has actually been run, not just assumed to work.
Security & access	Secrets are managed, access is least-privilege, and a leak can be contained.	A rehearsed path to rotate a leaked credential quickly.
Scalability & capacity	Load tested to a realistic multiple of peak, with autoscaling inside known limits.	A real load test - capacity is usually assumed, not measured.
Incident response	A reachable owner, runbooks, and a known escalation path when it breaks.	Runbooks linked from the alerts for the most likely failures.
Data & backups	Backups run on a schedule and a restore has actually been performed.	A tested restore - a backup you have never restored is a guess.

In practice

How to use this checklist

Answer every item one of three ways

Do not treat this as a wall to clear in one sitting. Walk it once early, while there is still time to fix what it surfaces, and again just before launch. For each item, the answer is yes, not yet, and here is the plan, or no, and we are accepting that risk because…. All three are fine. The only bad answer is a gap nobody named.

Scale the rigor to the blast radius

A customer-facing payment service earns every check; an internal dashboard does not. The checklist's job is to make those trade-offs deliberate and written down, not to impose the same bar on everything.

The one-line takeaway

A service is production-ready when an on-call engineer who did not build it can tell that it broke, find out why, and fix it - without calling the author at 2 a.m.

FAQ

Common questions

A production readiness review (PRR) is a structured check, run before a service ships, that confirms it is reliable, observable, secure, and operable. It walks a fixed checklist across dimensions like resilience, monitoring, deploys, security, capacity, and incident response, and surfaces the gaps that would otherwise turn into a 2 a.m. page. The goal is not to block the launch but to make sure the team knows what they are signing up to operate.

A good checklist covers reliability and resilience (health checks, timeouts, retries, graceful degradation), observability (metrics, logs, traces, and alerts that link to runbooks), CI/CD and deploys (automated, gated, and reversible), security and access (managed secrets, least privilege, dependency scanning), scalability and capacity (load tested, autoscaling, known limits), incident response (on-call, runbooks, postmortems), and data and backups (tested restores, retention, a disaster-recovery plan). Each item should be concrete and verifiable, not aspirational.

Typically the team that owns the service runs it, often with a reviewer from a platform, SRE, or infrastructure group who has operated similar systems. The owning team fills in the checklist; the reviewer pushes on the answers and flags risks the team is too close to see. On smaller teams a single senior engineer or an external partner can play the reviewer role - the value is in a second set of eyes against a fixed standard.

A launch checklist is usually about the release event - feature flags, marketing, support coverage, the go/no-go call. Production readiness is about everything after the launch: whether the service can be observed, scaled, recovered, and operated for months without heroics. They overlap, but readiness is the durable one. A feature can launch and still fail readiness if nobody can tell when it breaks or restore it when it does.

No. The checklist is a prompt, not a gate to clear in full. A low-traffic internal tool does not need the same resilience or capacity rigor as a customer-facing payment path. Use the checklist to make the trade-offs explicit: decide which items matter for this service's blast radius and which you are consciously deferring, and write down why. Silent gaps are the problem, not deliberate ones.

Want this checked against your real setup?

We run a fixed-scope infrastructure audit against this exact checklist - reliability, observability, deploys, security, capacity, incidents, and backups - and hand you a prioritized list of what to fix and how. No retainer required to start.

See the infrastructure audit Start a project

Keep going