Currently Working On
The Data Engineering team is currently partnering with Platform Engineering, Machine Learning and Search teams to delivering on a number of important data platform goals. In the middle of 2022 we formed small, matrixed Value Stream groups that are working on the goals listed below.
|Data Platform: Events
- Conclude the work from the Event Stream Experiments
- Deliver a consolidated, enriched and ordered stream that is available to the community
- Deliver a way for internal teams to query the current state of MediaWiki with a delay of 3-4 hours, removing the reliance on the monthly dumps
- Deploy Flink to the new DSE k8s cluster as an experimental/development environment
- Deploy Flink to a production multi-dc environment
- Build tooling to support Engineers who want to build event driven services
- Build event driven data integration services that allow teams to be agnostic of the underlying database architecture
- Build a current state store to allow bootstrapping of services and a view of the current state of MediaWiki
||Event Platform Roadmap on Miro
|Data Platform: Pipelines & Services
- Deliver a way for engineers, analysts and data users to create, deploy and test their own data pipelines.
- Multi-tenantisie airflow with APIs.
- Deliver clear documentation on how to write, deploy and monitor pipelines.
- Deliver APIs to users hook their own development environments into airflow.
- Deliver a consistent and reliable airflow experience to teams who need it.
- Allow for creation of Data Pipelines that interact with our data, without kerberos acting as a blocker.
- Deploy airflow to K8 (ideally DSE)
- Provide a CI/CD interface for deploying and monitoring data pipelines.
- Migrate existing ETL Jobs to airflow.
- Support the Structured Data teams implementation of SDAW grant work: Section Topics Data Pipeline (Q1) and Section Level Image Suggestions (Q2)
||Data Pipeline work for SDAW: SDAW Miro Board
Data Pipeline Roadmap
Analytics/Systems/Cluster/Spark/Migration to Spark 3
|Data Platform: Shared Data Infrastructure
- Deploy Data Science and Engineering Kubernetes Cluster
- Deploy a K8 Cluster using existing training wing hardware
- Deploy a High Performance Ceph Cluster for Persistent volume storage.
- Expand initial cluster with additional compute nodes.
- Deploy a stateless pilot (Kubeflow or Flink?)
- Deploy a stateful pilot (Data Warehouse)
- Migrate JupyterHub
||Shared Data Infrastructure Roadmap ppt
|Data Platform: Metrics & Experimentation
- Development and adoption of client libraries to generate MP Events.
- Integrate feature flag functionality into the Metrics Platform libraries
- Deliver a mechanism to run AB tests uraries.
||Metrics & Experimentation Roadmap
|Data Platform: Community Datasets (Dumps)
- Automate the generation of Dumps (Using a data pipeline?)
- Leverage the events experiments to make it more incremental?
Other Collaborative Work
We are also working with a number of other data teams (Product Analytics, Research and GDI) to implement and drive data catalog adoption by providing a centralized location to discover and document datasets and metrics. For more information about our data catalog see Datahub.