Event Platform/SLO/Flink Kubernetes Operator/

From Wikitech

Service

A Kubernetes Operator that allows users to manage Flink applications and their lifecycle through native k8s tooling like kubectl. Wikimedia maintains an operator for the upstream Apache project at https://github.com/wikimedia/operations-deployment-charts/tree/master/charts/flink-kubernetes-operator. The Kubernetes operator enables the use of the FlinkDeployment CRD to deploy native Flink clusters in single application mode. All Wikimedia Flink clusters should be deployed atop this service using the provided flink-app chart template.

Teams

The Data Platform Engineering and the Search Team share responsibility for this service.

Architectural

Instructions

Environmental dependencies

The operator runs on k8s main (eqiad, codfw), staging and dse-k8s. Deploying this service required admin privileges.

Service dependencies

Client-facing

Instructions

Clients

A Helm chart template to deploy Flink applications using the operator is available at https://github.com/wikimedia/operations-deployment-charts/tree/master/charts/flink-app.

Clients are expected to implement their own deployments via a set of helmfiles.

Request Classes

What are the request classes, and how is a request classified? If your service has only one request class, delete this section.

Service Level Indicators (SLIs)

Instructions

  • Service uptime (availability): measured by the kubernetes POD uptime metric. When the PODs are down, the existing Flink cluster won’t be operational and new clusters can’t be deployed.
  • Deployed clusters state: the percentage of Flink clusters in STABLE vs MISSING or ERROR state.

Operational

Instructions

Monitoring

The Flink Kubernetes operator  emits timeseries metrics (counter, gauges) for all SLIs. They are available in Grafana.

Logs are available in Logstash (ECS index).

Troubleshooting

TBD. Needs input from SRE / admins.

Deployment

The service is deployed with deployment-charts. See https://github.com/wikimedia/operations-deployment-charts/tree/master/charts/flink-kubernetes-operator.

Deploying this operator implies a restart of the clusters that run atop it.

Service Level Objectives

Instructions

Realistic targets

TBD: this is a relaxed target that would not be acceptable by dependend systems (e.g. WDQS has an SLA for serving pages with al lag lag < 10min 95%). Looking at metrics on main, this looks achievable, question is whether responsible teams can commit to it.

A realistic target for availability would be 80% operator pods uptime.

A realistic target for deployed clusters state would be deployments are STABLE 80% of the time.

Ideal targets

An ideal target for availability would be 99% operator pods uptime.

An ideal target for deployed clusters state would be deployments are STABLE 99% of the time.

Reconciliation

Reconcile the realistic vs. ideal targets, documenting any decisions made along the way.

Once the SLO is final, consider collapsing the above three sections.

What are the agreed-upon SLOs, for each SLI and each request class?