Data Engineering/Currently working on

From Wikitech


Currently Working On

The Data Engineering team is currently partnering with Platform Engineering, Machine Learning and Search teams to delivering on a number of important data platform goals. In the middle of 2022 we formed small, matrixed Value Stream groups that are working on the goals listed below.

Value Streams

Value Stream Goals PM EM Board Roadmap Link
Data Platform: Events
  • Conclude the work from the Event Stream Experiments
  • Deliver a consolidated, enriched and ordered stream that is available to the community
  • Deliver a way for internal teams to query the current state of MediaWiki with a delay of 3-4 hours, removing the reliance on the monthly dumps
  • Deploy Flink to the new DSE k8s cluster as an experimental/development environment
  • Deploy Flink to a production multi-dc environment
  • Build tooling to support Engineers who want to build event driven services
  • Build event driven data integration services that allow teams to be agnostic of the underlying database architecture
  • Build a current state store to allow bootstrapping of services and a view of the current state of MediaWiki
Luke Will #event-platform Event Platform Roadmap on Miro
Data Platform: Pipelines & Services
  • Deliver a way for engineers, analysts and data users to create, deploy and test their own data pipelines.
    • Multi-tenantisie airflow with APIs.
    • Deliver clear documentation on how to write, deploy and monitor pipelines.
    • Deliver APIs to users hook their own development environments into airflow.
  • Deliver a consistent and reliable airflow experience to teams who need it.
    • Allow for creation of Data Pipelines that interact with our data, without kerberos acting as a blocker.
    • Deploy airflow to K8 (ideally DSE)
    • Provide a CI/CD interface for deploying and monitoring data pipelines.
  • Migrate existing ETL Jobs to airflow.
  • Support the Structured Data teams implementation of SDAW grant work: Section Topics Data Pipeline (Q1) and Section Level Image Suggestions (Q2)
Luke Olja #Data-Pipelines Data Pipeline work for SDAW: SDAW Miro Board

Data Pipeline Roadmap

Data Engineering/Systems/Airflow

Analytics/Systems/Cluster/Spark/Migration to Spark 3

Data Platform: Shared Data Infrastructure
  • Deploy Data Science and Engineering Kubernetes Cluster
    • Deploy a K8 Cluster using existing training wing hardware
    • Deploy a High Performance Ceph Cluster for Persistent volume storage.
    • Expand initial cluster with additional compute nodes.
  • Deploy a stateless pilot (Kubeflow or Flink?)
  • Deploy a stateful pilot (Data Warehouse)
  • Migrate JupyterHub
Luke Olja #DSE(A)-Cluster Shared Data Infrastructure Roadmap ppt
Data Platform: Metrics & Experimentation
  • Development and adoption of client libraries to generate MP Events.
  • Integrate feature flag functionality into the Metrics Platform libraries
  • Deliver a mechanism to run AB tests uraries.
Will #Metrics_Platform Metrics & Experimentation Roadmap
Data Platform: Community Datasets (Dumps)
  • Automate the generation of Dumps (Using a data pipeline?)
  • Leverage the events experiments to make it more incremental?
Luke Will

Other Collaborative Work

We are also working with a number of other data teams (Product Analytics, Research and GDI) to implement and drive data catalog adoption by providing a centralized location to discover and document datasets and metrics. For more information about our data catalog see Datahub.