Wikimedia Cloud Services team/EnhancementProposals/Production Readiness Checklist

A list of items that we should think about before deploying new services/components/changes in production.

This is a working in progress, feel free to contribute.

Support Lifecycle

Document support lifecycle of the solution (roadmap, timeline, expected SLA/SLO, etc)

Document all components that make up the solution and their supported lifecycle for improvements, bug fixes, security fixes, etc.

If we're using LTS versions or not, how long they're supported, by whom, etc.

All resources consumed by users must have an upper limit to avoid unbounded resource usage.

It's necessary to understand what levels of performance the solution can provide.

Benchmark and document critical user-facing points in the solution to understand system behavior globally.

Benchmark and document low-level building blocks (servers, memory throughput network-attached storage, local disks, network, etc).

Infrastructure diagrams
High-level overview of data flow, request timeline, etc
Runbooks
- How to deploy from scratch
- How to deploy new changes
- How to restart
- Creating/deleting/changing resources (users, apps, projects, etc... any manageable object)
Troubleshooting steps

[[[List here other companies' production readiness checklists for comparison. ]]]