Jump to content

User:BTullis (WMF)/SRE

From Wikitech

Site Reliability Engineering (SRE)

The team is responsible for developing and maintaining Wikimedia's production infrastructure. Previously known as Technical Operations, they are in charge of making sure all Wikimedia's sites and services used by the general public (including MediaWiki and all associated services) run reliably, securely, and with high performance.

  • If you need help from SRE and it is an emergency, you can page us via https://klaxon.wikimedia.org.
  • If it is not an emergency, but do not know which team is responsible for your question, just open a generic task on Phabricator in the SRE project and our Clinic Duty engineer of the week will route it.
  • If it more urgent or just a quick check you can find us on IRC: #wikimedia-sreconnect.

The Foundation has a number of sub-teams within SRE, each responsible for different areas:

SRE Data Center Operations

SRE Data Center Operations - all things related to Data Centers, hardware maintenance and purchases.

The Data Center Operations team is responsible for all of Wikimedia’s data center deployments and logistics as well as maintaining our presence in locations across the world. They perform on-site work and maintain the full 5-year life cycle (specs, purchasing, physical install, break/fix and decommissioning) for all hardware.

#wikimedia-dcopsconnect

SRE Data Persistence

SRE Data Persistence - Databases, Backups and Object storage (MariaDB, Bacula, Swift).

The Data Persistence team focuses on Wikimedia’s persistent data storage and retrieval systems, including RDBMS, backup systems and (distributed) object storage.

#wikimedia-data-persistenceconnect

SRE Infrastructure Foundations

SRE Infrastructure Foundations - Automation and Networking (cumin, netbox, puppet, spicerack).

The team focuses on building and maintaining our base platform (“metal cloud”) that forms the foundations which nearly everything else in our infrastructure builds upon. On top of our bare metal deployments, their responsibilities include (but are not limited to) configuration management systems, infrastructure automation, orchestration tooling, infrastructure security and network operations.

#wikimedia-sre-foundationsconnect

SRE Observability

SRE Observability - Monitoring and Logging (Prometheus/Grafana and ElasticSearch, plus some Kafka).

The Observability team, or "o11y" for short, works across SRE and Technology to provide teams with tools, platforms and insights into how systems and services are performing. It leverages technologies such as Grafana, Kibana/Logstash, Prometheus, AlertManager and more.

#wikimedia-observabilityconnect

SRE Service Operations

SRE Service Operations - MediaWiki Operations and Supporting Services (Kubernetes, memcached, redis, Infrastructure for: Gitlab, OTRS, Phabricator).

The Service Operations team takes care of public and “user-visible” services alongside Technology and Product teams. This means, for example, our MediaWiki platform, but also the newer (micro)services that comprise our stack. It also includes miscellaneous services and components that we rely upon (think Phabricator, mail systems, OTRS, etc…). The team is also building our new SOA service infrastructure based on Kubernetes.

#wikimedia-serviceopsconnect

SRE Traffic

SRE Traffic - Caching and DNS (ATS, varnish, GeoDNS, wikidough).

The Traffic team is responsible for the critical first layer of high-traffic infrastructure which now spans much of the globe, including our TLS termination and caching layers (ATS, Varnish), load balancing, DNS and our own network.

#wikimedia-trafficconnect


In addition to the Core SRE teams above, various other teams within the Foundation have Embedded SRE teams who each have their own delegated areas of responsibility.

Data Engineering

Data Engineering - Hadoop, Hive, Presto, Druid, Airflow, Superset, Jupyter, some Kafka, Cassandra, Kubernetes, and Ceph.

Our team provides a self-service, privacy-aware data platform that empowers people to gain data-driven insights and build better product experiences for Wikimedia communities. We maintain the big data platform including the data lake, ingestion and processing pipelines, as well as a number of systems to explore and visualize the data.

#wikimedia-analyticsconnect

Machine Learning

Machine Learning - Lift Wing Kubernetes cluster, ORES

Our team designs, builds, and maintains the foundation's machine learning infrastructure. We plan, train, deploy, and manage production machine learning models created or requested by Wikimedia teams or Wiki communities. We also develop best practices for applied ethical machine learning.

#wikimedia-mlconnect

Cloud Services

Cloud Services - VPS, Toolforge

The Wikimedia Cloud Services team (WMCS) is a subteam of the Technical Engagement team responsible for maintaining and extending the existing Wikimedia Cloud Services computing infrastructure (virtual private server (VPS)), the Toolforge hosting environment (platform as a service (PaaS)), and many additional supporting technologies used in the Cloud Services environment.

#wikimedia-cloudconnect

Platform Engineering

Platform Engineering - MediaWiki, Data Platform, API Platform, Platform Operations, WMF Dumps and the Wikimedia service infrastructure

We shepherd Wikimedia’s essential software and infrastructure technologies enabling our users and developers to unlock free knowledge.

Check SRE Team Requests to see most how to see most common types of requests.

References: