Event Platform/Onboarding
Stream Enrichment Processing with PyFlink
This section summarizes the steps necessary to onboard a new streaming application to Event Platform's Wikimedia infrastructure.
If you'd like to get started with developing Python streaming applications, a good starting point would be eventutiliites-python
tutorial and documentation.
To start the onboarding process, contact the Event Plafrom team (#event-platform on Slack) and create a new phab task using the Event Platform onboarding template.
Steps required to onboard a new application to Wikimedia internal infrastructure
A deployable application must be dockerized. The application image should be available in our interal docker registry. The application git repo should be added to Gitlab’s Trusted Runners. This will allow publishing images to the internal docker registry. Open a MR to https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/ with releng in CC (example https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/merge_requests/14).
If you created an application using eventutiliites-python's cookiecutter template, a Blubber file and Kokkuri Gitlab CI pipeline templates have been provided. This should be all you need to publish imaged internally following Deployment Pipeline best practices.
Event schema
- Create a schema for the new event. Schemas are managed in a schema monorepo and exposed via https://schema.wikimedia.org. An application should bundle a local checkout of the schema repository (happens automatically if the app is generated using our cookie cutter template).
- Tools you will need:
- How to do it
- Open a merge request (Gitlab) against
- https://gitlab.wikimedia.org/repos/data-engineering/schemas-event-primary (for use cases at Mediawiki and Wikidata level of SLO)
- https://gitlab.wikimedia.org/repos/data-engineering/schemas-event-secondary (for non user-facing features).
- Open a merge request (Gitlab) against
- Documentation
- Stakeholders:
- Event Platform team
Stream configuration.
- Declare a stream configuration in the EventStreamConfig Mediawiki extension.
- Information you will need:
- Input stream name
- Output stream name
- Event Schema
- EventGate instance
- How to do it:
- Open a change request (Gerrit) against
- Mediawiki-config
- Open a change request (Gerrit) against
- Documentation
- https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Configuration
- Stakeholders:
- Event Platform (code review, EventGate restarts)
- Mediawiki deployers (deployment)
- Information you will need:
Deployment to k8s
- Information you will need
- SLO. In particular guidelines for troubleshooting and availability requirements.
- Resources (cores, memory, number of containers)
- Traffic (through, payload size)
- Dependencies
- Which Kafka cluster will be used
- How to do it
- The application should be onboarded to k8s main. See https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments
- Grafana Dashboard for Event Platform are auto-generated with key metrics (Flink App)
- The application will be assigned a new k8s namespace
- Create helmfiles for staging / eqiad / codfw
- Swift bucket (Flink HA) must be requested to Data Persistence SRE
- Onboard on Alertmanager (liaise with Observability SRE). See https://wikitech.wikimedia.org/wiki/Alertmanager#Onboard
- The application should be onboarded to k8s main. See https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments
- Stakeholders
- Event Platform
- SRE
- Releng
- Data Persistence SRE
- Observability
- Information you will need
Make the stream public
- Infromation you will need:
- The stream and event schema must go through a Security and Privacy review
- EventStream service needs to declare the stream as public
- Stakeholders
- Security
- Event Platform: update & deploy EventStreams
General Workflow Overview
The following assumes you have decided to build an event driven data pipeline using the Event Platform team capabilities and provides a flow on what is needed for each step
