Wikimedia Cloud Services team/EnhancementProposals/Production Readiness Checklist

From Wikitech

A list of items that we should think about before deploying new services/components/changes in production.

This is a working in progress, feel free to contribute.


Support Lifecycle

Service Level

Document support lifecycle of the solution (roadmap, timeline, expected SLA/SLO, etc)

Infrastructure components

Document all components that make up the solution and their supported lifecycle for improvements, bug fixes, security fixes, etc.

If we're using LTS versions or not, how long they're supported, by whom, etc.

Resource Limits

All resources consumed by users must have an upper limit to avoid unbounded resource usage.

  • CPU/memory/disk usage should have clear limits per user/application
  • API endpoints should be rate limited

Performance Tests

It's necessary to understand what levels of performance the solution can provide.

User Facing

Benchmark and document critical user-facing points in the solution to understand system behavior globally.

Building Blocks

Benchmark and document low-level building blocks (servers, memory throughput network-attached storage, local disks, network, etc).

Documentation

User Documentation

  • Documentation for most common use cases
  • Frequently Asked Question page
  • Contact page

Admin Documentation

  • Infrastructure diagrams
  • High-level overview of data flow, request timeline, etc
  • Runbooks
    • How to deploy from scratch
    • How to deploy new changes
    • How to restart
    • Creating/deleting/changing resources (users, apps, projects, etc... any manageable object)
  • Troubleshooting steps

Monitoring

Servers

Critical Service Components

Blackbox Monitoring

Whitebox Monitoring

Backups

See also

External Resources

  • [[[List here other companies' production readiness checklists for comparison. ]]]