Wikimedia Cloud Services team/EnhancementProposals/Production Readiness Checklist
This page is currently a draft.
More information and discussion about changes to this draft on the talk page.
A list of items that we should think about before deploying new services/components/changes in production.
This is a working in progress, feel free to contribute.
Document support lifecycle of the solution (roadmap, timeline, expected SLA/SLO, etc)
Document all components that make up the solution and their supported lifecycle for improvements, bug fixes, security fixes, etc.
If we're using LTS versions or not, how long they're supported, by whom, etc.
All resources consumed by users must have an upper limit to avoid unbounded resource usage.
- CPU/memory/disk usage should have clear limits per user/application
- API endpoints should be rate limited
It's necessary to understand what levels of performance the solution can provide.
Benchmark and document critical user-facing points in the solution to understand system behavior globally.
Benchmark and document low-level building blocks (servers, memory throughput network-attached storage, local disks, network, etc).
- Documentation for most common use cases
- Frequently Asked Question page
- Contact page
- Infrastructure diagrams
- High-level overview of data flow, request timeline, etc
- How to deploy from scratch
- How to deploy new changes
- How to restart
- Creating/deleting/changing resources (users, apps, projects, etc... any manageable object)
- Troubleshooting steps
Critical Service Components
- Portal:Cloud VPS/Admin/Deployment confidence checklist
- How to deploy code
- mw:Manual:Pre-commit checklist
- mw:Review queue#Checklist/Process
- mw:Best practices for extensions
- mw:API:Client code/Gold standard
- [[[List here other companies' production readiness checklists for comparison. ]]]