Jump to content

Data Platform/Systems/DataHub

From Wikitech
DataHub User Interface

We run an instance of DataHub which acts as a centralized data catalog, intended to facilitate the following:

  • Discovery by potential users of the various data stores operated by WMF.
  • Documentation of the data structures, formats, access rights, and other associated details.
  • Governance of these data stores, including details of retention, sanitization, recording changes over time.

Accessing DataHub

Frontend

The URL for the web interface for DataHub is: https://datahub.wikimedia.org

Access to this service requires a Wikimedia developer account and access is currently limited to members of the wmf or nda LDAP groups. Authentication will be performed by the CAS-SSO single-sign-on system.

Generalized Metadata Service

The DataHub Generalized Metadata Service (GMS) is: https://datahub-gms.svc.eqiad.wmnet:30443

The GMS service is not public facing and it is only available from our private networks. Currently we have not enabled authentication on this interface, although it is planned.

Via the CLI

The datahub CLI can be used to interact with the Datahub API, from one of the stat hosts. To to this, ssh onto one of these hosts, say stat1004.eqiad.wmnet, and run the following commands to install datahub (bypass them if you already have it installed)

cat << EOF > ~/.datahubenv
gms:
  server: https://datahub-gms.svc.eqiad.wmnet:30443
  token: ''
EOF
set_proxy
source /opt/conda-analytics/etc/profile.d/conda.sh
conda-analytics-clone datahub-env
conda activate datahub-env
pip install acryl-datahub

Once you have acryl-datahub installed in your activated conda environment, run the following commands to use it:

export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
datahub get --urn 'urn:li:dataset:(urn:li:dataPlatform:kafka,MetadataChangeEvent_v4,PROD)' # should work!

Accessing the Staging Instance

Staging instance is accessible via https://datahub.wikimedia.org

Service Overview

DataHub Components

The DataHub instance is composed of several components, each built from the same codebase

  • a metadata server (or GMS)
  • a frontend web application
  • an mce consumer (metadata change event)
  • an mae consumer (metadata audit event)

All of these components are stateless and currently run on the Wikikube Kubernetes clusters.

Their containers are built using the Deployment pipeline and the configuration for this is in the wmf branch of our fork of the datahub repository:

Backend Data Tiers

The stateful components of the system are:

  • a MariaDB database on the analytics-meta database
  • an Opensearch cluster running on three VMs named datahubsearch100[1-3].eqiad.wmnet
  • an instance of Karapace, which acts as a schema registry
  • a number of Kafka topics

Our Opensearch cluster fulfils two roles of:

  • a search index
  • a graph database

The design document for the DataHub service (restricted to WMF staff).

We had previously carried out a Data Catalog Application Evaluation and subsequently the decision was taken to use DataHub and to implement an MVP deployment.

Metadata Sources

We have several key sources of metadata.

Ingestion

Currently ingestion can be performed by any machine on our private networks, including the stats servers.

Automated Ingestion

We are moving to automated and regularly scheduled metadata ingestion using Airflow. Please check back soon for updated documentation on this topic.

Manual Ingestion Example

The following procedure should help to get started with manual ingestion.

  1. Select a stats server for your use.
  2. Activate a conda environment.
  3. Configure the HTTP proxy servers in your shell (run set_proxy)
  4. Install the necessary python modules
# it's very important to install the same version CLI as server that's running,
# otherwise ingestion will not work
pip install acryl-datahub==0.10.4
datahub version
datahub init
server: https://datahub-gms.svc.eqiad.wmnet:30443

Then create a recipe file, install more plugins if required. And run by adding this certificate in the environment: REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt datahub ingest -c recipe.yaml

Some examples of recipes, including Hive, Kafka, and Druid are available on this ticket

Operations

See OpenSearch#Operations

Administration

https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/DataHub/Administration