Production readiness checklist.
What production readiness means
Most production incidents are not caused by exotic bugs. They are caused by the ordinary things that were never set up: no alert fired, no runbook existed, the rollback was untested, the backup had never been restored. A readiness review is the cheap, boring step that catches those gaps while they are still easy to close.
Run it as a checklist, not a vibe. Walk each dimension below, answer honestly, and treat every no as either work to do before launch or a risk you are accepting on purpose and writing down.
The checklist
Reliability & resilience - can the service take a hit and keep serving?
- Health and readiness checks exist and the orchestrator uses them to route traffic.
- Every outbound call has a timeout - no unbounded waits on a dependency.
- Retries use backoff and a cap, and only apply to idempotent operations.
- The service degrades gracefully when a non-critical dependency is down.
- There is no single instance with no replica behind it for anything that matters.
Observability - can you tell what it is doing without SSH-ing into a box?
- p95 latency and error-rate SLOs are defined and dashboarded.
- Logs are structured, centralized, and carry a request or trace ID.
- Requests are traced across service boundaries, not just within one service.
- Alerts fire on user-facing symptoms (errors, latency), not just on CPU.
- Every alert links to a runbook - no alert without an owner and a next step.
CI/CD & deploys - is shipping safe, repeatable, and reversible?
- Deploys are fully automated from a pipeline - no manual steps on a laptop.
- The pipeline is gated: tests, linting, and security checks must pass to ship.
- Rollback is one command or one click and has actually been tested.
- Database migrations are backward-compatible so a rollback does not corrupt data.
- Build artifacts are immutable and pinned by digest, not a floating tag.
Security & access - is the blast radius of a mistake or a breach contained?
- Secrets live in a secrets manager, never in code, images, or environment files in the repo.
- Service and human access follow least privilege - no shared admin credentials.
- Dependencies are scanned for known vulnerabilities in the pipeline.
- Traffic is encrypted in transit, and sensitive data is encrypted at rest.
- There is a clear path to rotate a leaked credential quickly.
Scalability & capacity - will it hold under the load you actually expect?
- The service has been load tested to a realistic multiple of peak traffic.
- Autoscaling is configured with sane minimums, maximums, and triggers.
- Known limits (connection pools, rate limits, quotas) are documented.
- There is no shared resource - a database, a queue - that becomes the bottleneck first.
Incident response - when it breaks, does the team know what to do?
- There is an on-call rotation with a clear, reachable owner at all times.
- Runbooks exist for the most likely failures and are linked from the alerts.
- Severity levels and an escalation path are defined and known by the team.
- Incidents get a blameless postmortem, and the action items actually get done.
Data & backups - if the worst happens, can you get the data back?
- Backups run automatically on a schedule that matches the data's importance.
- A restore has actually been performed - a backup you have never restored is a guess.
- Retention and deletion policies are defined and meet any compliance needs.
- There is a written disaster-recovery plan with target recovery time and recovery point.
How to use this checklist
Do not treat this as a wall to clear in one sitting. Walk it once early, while there is still time to fix what it surfaces, and again just before launch. For each item, the answer is yes, not yet, and here is the plan, or no, and we are accepting that risk because…. All three are fine. The only bad answer is a gap nobody named.
Scale the rigor to the blast radius. A customer-facing payment service earns every check; an internal dashboard does not. The checklist's job is to make those trade-offs deliberate and written down, not to impose the same bar on everything.
The one-line takeaway
Want this checked against your real setup?
We run a fixed-scope infrastructure audit against this exact checklist - reliability, observability, deploys, security, capacity, incidents, and backups - and hand you a prioritized list of what to fix and how. No retainer required to start.