Data Engineering/Currently working on

What we own

Currently working on

Team

Documentation

Learning Materials

Contact Us

Currently Working On

The Data Engineering team is currently partnering with Platform Engineering, Machine Learning and Search teams to delivering on a number of important data platform goals. In the middle of 2022 we formed small, matrixed Value Stream groups that are working on the goals listed below.

Value Streams


Value Stream	Goals	PM	EM	Board	Roadmap Link
Data Platform: Events	Conclude the work from the Event Stream Experiments Deliver a consolidated, enriched and ordered stream that is available to the community Deliver a way for internal teams to query the current state of MediaWiki with a delay of 3-4 hours, removing the reliance on the monthly dumps Deploy Flink to the new DSE k8s cluster as an experimental/development environment Deploy Flink to a production multi-dc environment Build tooling to support Engineers who want to build event driven services Build event driven data integration services that allow teams to be agnostic of the underlying database architecture Build a current state store to allow bootstrapping of services and a view of the current state of MediaWiki	Luke	Will	#event-platform	Event Platform Roadmap on Miro
Data Platform: Pipelines & Services	Deliver a way for engineers, analysts and data users to create, deploy and test their own data pipelines. Multi-tenantisie airflow with APIs. Deliver clear documentation on how to write, deploy and monitor pipelines. Deliver APIs to users hook their own development environments into airflow. Deliver a consistent and reliable airflow experience to teams who need it. Allow for creation of Data Pipelines that interact with our data, without kerberos acting as a blocker. Deploy airflow to K8 (ideally DSE) Provide a CI/CD interface for deploying and monitoring data pipelines. Migrate existing ETL Jobs to airflow. Support the Structured Data teams implementation of SDAW grant work: Section Topics Data Pipeline (Q1) and Section Level Image Suggestions (Q2)	Luke	Olja	#Data-Pipelines	Data Pipeline work for SDAW: SDAW Miro Board Data Pipeline Roadmap Data Engineering/Systems/Airflow Analytics/Systems/Cluster/Spark/Migration to Spark 3
Data Platform: Shared Data Infrastructure	Deploy Data Science and Engineering Kubernetes Cluster Deploy a K8 Cluster using existing training wing hardware Deploy a High Performance Ceph Cluster for Persistent volume storage. Expand initial cluster with additional compute nodes. Deploy a stateless pilot (Kubeflow or Flink?) Deploy a stateful pilot (Data Warehouse) Migrate JupyterHub	Luke	Olja	#DSE(A)-Cluster	Shared Data Infrastructure Roadmap ppt
Data Platform: Metrics & Experimentation	Development and adoption of client libraries to generate MP Events. Integrate feature flag functionality into the Metrics Platform libraries Deliver a mechanism to run AB tests uraries.		Will	#Metrics_Platform	Metrics & Experimentation Roadmap
Data Platform: Community Datasets (Dumps)	Automate the generation of Dumps (Using a data pipeline?) Leverage the events experiments to make it more incremental?	Luke	Will

Other Collaborative Work

We are also working with a number of other data teams (Product Analytics, Research and GDI) to implement and drive data catalog adoption by providing a centralized location to discover and document datasets and metrics. For more information about our data catalog see Datahub.