Event Platform/SLO/Flink Kubernetes Operator/
This page is currently a draft. More information and discussion about changes to this draft on the talk page. |
Service
A Kubernetes Operator that allows users to manage Flink applications and their lifecycle through native k8s tooling like kubectl. Wikimedia maintains an operator for the upstream Apache project at https://github.com/wikimedia/operations-deployment-charts/tree/master/charts/flink-kubernetes-operator. The Kubernetes operator enables the use of the FlinkDeployment CRD to deploy native Flink clusters in single application mode. All Wikimedia Flink clusters should be deployed atop this service using the provided flink-app chart template.
Teams
The Data Platform Engineering and the Search Team share responsibility for this service.
Architectural
Environmental dependencies
The operator runs on k8s main (eqiad, codfw), staging and dse-k8s. Deploying this service required admin privileges.
Service dependencies
Client-facing
Clients
A Helm chart template to deploy Flink applications using the operator is available at https://github.com/wikimedia/operations-deployment-charts/tree/master/charts/flink-app.
Clients are expected to implement their own deployments via a set of helmfiles.
Request Classes
What are the request classes, and how is a request classified? If your service has only one request class, delete this section.
Service Level Indicators (SLIs)
- Service uptime (availability): measured by the kubernetes POD uptime metric. When the PODs are down, the existing Flink cluster wonât be operational and new clusters canât be deployed.
- Deployed clusters state: the percentage of Flink clusters in STABLE vs MISSING or ERROR state.
Operational
Monitoring
The Flink Kubernetes operator emits timeseries metrics (counter, gauges) for all SLIs. They are available in Grafana.
Logs are available in Logstash (ECS index).
Troubleshooting
TBD. Needs input from SRE / admins.
Deployment
The service is deployed with deployment-charts. See https://github.com/wikimedia/operations-deployment-charts/tree/master/charts/flink-kubernetes-operator.
Deploying this operator implies a restart of the clusters that run atop it.
Service Level Objectives
Realistic targets
TBD: this is a relaxed target that would not be acceptable by dependend systems (e.g. WDQS has an SLA for serving pages with al lag lag < 10min 95%). Looking at metrics on main, this looks achievable, question is whether responsible teams can commit to it.
A realistic target for availability would be 80% operator pods uptime.
A realistic target for deployed clusters state would be deployments are STABLE 80% of the time.
Ideal targets
An ideal target for availability would be 99% operator pods uptime.
An ideal target for deployed clusters state would be deployments are STABLE 99% of the time.
Reconciliation
Reconcile the realistic vs. ideal targets, documenting any decisions made along the way.
Once the SLO is final, consider collapsing the above three sections.
What are the agreed-upon SLOs, for each SLI and each request class?