We run an instance of DataHub which acts as a centralized data catalog, intended to facilitate the following:
- Discovery by potential users of the various data stores operated by WMF.
- Documentation of the data structures, formats, access rights, and other associated details.
- Governance of these data stores, including details of retention, sanitization, recording changes over time.
The URL for the web interface for DataHub is: https://datahub.wikimedia.org
Access to this service requires a Wikimedia developer account and access is currently limited to members of the
nda LDAP groups. Authentication will be performed by the CAS-SSO single-sign-on system.
Generalized Metadata Service
The DataHub Generalized Metadata Service (GMS) is: https://datahub-gms.discovery.wmnet:30443
The GMS service is not public facing and it is only available from our private networks. Currently we have not enabled authentication on this interface, although it is planned.
Via the CLI
datahub CLI can be used to interact with the Datahub API, from one of the
stat hosts. To to this, ssh onto one of these hosts, say
stat1004.eqiad.wmnet, and run the following commands to install datahub (bypass them if you already have it installed)
cat << EOF > ~/.datahubenv gms: server: https://datahub-gms.discovery.wmnet:30443 token: '' EOF set_proxy source /opt/conda-analytics/etc/profile.d/conda.sh conda-analytics-clone datahub-env conda activate datahub-env pip install acryl-datahub
Once you have
acryl-datahub installed in your activated conda environment, run the following commands to use it:
export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt datahub get --urn 'urn:li:dataset:(urn:li:dataPlatform:kafka,MetadataChangeEvent_v4,PROD)' # should work!
Accessing the Staging Instance
We have a deployment of DataHub on the staging cluster, which is used as a path to production for testing version upgrades and new features.
This is not public facing and at present requires an SSH tunnel in order to get to it. Also, since we switched to using OIDC authentication, we need to be able to use port 443 on the local host. Therefore, in order to get to the DataHub frontend in staging you can do the following on your workstation:
- Add an entry to your local
/etc/hostsfile like this:
- Allow the SSH daemon to open privileged ports:
sudo setcap 'cap_net_bind_service=+ep' /usr/bin/ssh
- Open an SSH tunnel through a deployment server:
ssh -N -L 443:k8s-ingress-staging.svc.eqiad.wmnet:443 deploy1002.eqiad.wmnet
- Browse to https://datahub-frontend.k8s-staging.discovery.wmnet/ and accept the security warning about an unknown issuer. The TLS certificate is issued by our PKI and specifically the discovery intermediate CA, which will likely be un unknown issuer, as far as your browser is concerned.
The DataHub instance is composed of several components, each built from the same codebase
- a metadata server (or GMS)
- a frontend web application
- an mce consumer (metadata change event)
- an mae consumer (metadata audit event)
All of these components are stateless and currently run on the Wikikube Kubernetes clusters.
Backend Data Tiers
The stateful components of the system are:
- a MariaDB database on the analytics-meta database
- an Opensearch cluster running on three VMs named
- an instance of Karapace, which acts as a schema registry
- a number of Kafka topics
Our Opensearch cluster fulfils two roles of:
- a search index
- a graph database
The design document for the DataHub service (restricted to WMF staff).
We had previously carried out a Data Catalog Application Evaluation and subsequently the decision was taken to use DataHub and to implement an MVP deployment.
We have several key sources of metadata.
Currently ingestion can be performed by any machine on our private networks, including the stats servers.
We are moving to automated and regularly scheduled metadata ingestion using Airflow. Please check back soon for updated documentation on this topic.
Manual Ingestion Example
The following procedure should help to get started with manual ingestion.
- Select a stats server for your use.
- Activate a conda environment.
- Configure the HTTP proxy servers in your shell
- Install the necessary python modules
# it's very important to install the same version CLI as server that's running, # otherwise ingestion will not work pip install acryl-datahub==0.10.4 datahub version datahub init server: https://datahub-gms.discovery.wmnet:30443
Then create a recipe file, install more plugins if required.
datahub ingest -c recipe.yaml
Some examples of recipes, including Hive, Kafka, and Druid are available on this ticket