Data Engineering/Evaluations/2021 data catalog selection/Rubric/DataHub

From Wikitech

Core Service and Dependency Setup

DataHub was downloaded from https://github.com/linkedin/datahub/ onto stat1008 and tag 0.8.24 was checked out.

The build process required internet access so there were several places where the web proxy settings were required. Generally the build process was using gradle and was something like this:

./gradlew -Dhttp.proxyHost=webproxy -Dhttp.proxyPort=8080 -Dhttps.proxyHost=webproxy -Dhttps.proxyPort=8080 "-Dhttp.nonProxyHosts=127.0.0.1|localhost|*.wmnet" build

Any problems with the build were worked around using a build carried out on a workstation.

DataHub do not have a supported deployment method that doesn't use containers (i.e. docker) so in order to complete the setup of each of the required components, the steps from each Dockerfile were carried out manually.

Core DataHub Services

All DataHub components have an option to enable a prometheus JMX exporter, but this was not configured as part of the evaluation.

Metadata Service (GMS)

This runs as a Jetty web application on port 8080. The daemon is managed by a systemd user service that uses the following key configuration.

EnvironmentFile=/home/btullis/src/datahub/datahub/docker/datahub-gms/env/local.env
ExecStart=/usr/bin/java $JAVA_OPTS $JMX_OPTS -jar jetty-runner.jar --jar jetty-util.jar --jar jetty-jmx.jar ./war.war
WorkingDirectory=/home/btullis/src/datahub/datahub/docker/datahub-gms

This service listens on port 8080.

Frontend Service

This is a combination of a Play Framework application with a React frontend. Similarly to the GMS service, it is controlled by a systemd user service with the following configuration.

EnvironmentFile=/home/btullis/src/datahub/datahub/docker/datahub-frontend/env/local.env
ExecStart=/home/btullis/src/datahub/datahub/docker/datahub-frontend/datahub-frontend/bin/playBinary
WorkingDirectory=/home/btullis/src/datahub/datahub/docker/datahub-frontend

This service listens on port 9000

It uses JAAS for authentication. Initially it uses a flat file with a fixed datahub/datahub username password, but we could use LDAP for this, or possibly CAS.

Metadata Change Event (MCE) Consumer Job
DataHub ingestion architecture

This is a Kafka consumer that works on the ingestion side for DataHub. It reads jobs from a Kafka topic and applies the change to the persistent storage back-end. It then enqueues another job for the MAE consumer job to pick up.

Metadata Audit Event (MAE) Consumer Job

This is a Kafka consumer that is more related to the serving side of DataHub. It picks up MAE jobs from the Kafka topic and updates the search indices and the graph database.

DataHub Serving Architecture

Confluent Platform Services

A binary release of the Confluent Platform 5.4 was extracted to /home/btullis/src/datahub/confluent on stat1008 and this was used to run the following components with a default configuration.

Zookeeper

We ran a local zookeeper instance on stat1008 using a systemd user configuration with the following configuration:

Environment="JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64"
ExecStart=/home/btullis/src/datahub/confluent/bin/zookeeper-server-start etc/kafka/zookeeper.properties
ExecStop=/home/btullis/src/datahub/confluent/bin/zookeeper-server-stop
WorkingDirectory=/home/btullis/src/datahub/confluent/
Kafka

We ran a standalone broker on stat1008 using a systemd user service with the following configuration:

Environment="JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64"
ExecStart=/home/btullis/src/datahub/confluent/bin/kafka-server-start etc/kafka/server.properties
ExecStop=/home/btullis/src/datahub/confluent/bin/kafka-server-stop
WorkingDirectory=/home/btullis/src/datahub/confluent/

The required topics were created using the steps recorded at T299703

Schema Registry

A Schema Registry is a required component of DataHub, but we have an issue because the Confluent version does not use a compatible license. It is a Confluent Community License which is cost-free, but not sufficiently permissive for us to be able to use it. We have been discussing with DataHub themselves whether there is a workaround for this requirement. They have suggested one workaround, which is to use Karapace and they have also created a feature request to obviate the requirement.

Karapace was initially researched, but for the purposes of this evaluation we proceeded with the Confluent Schema Registry. This was set up as a systemd user unit in the same way as the other Confluent components.

Search Services

A binary distribution of OpenSearch 1.2.4 was extracted to /home/btullis/src/datahub/opensearch on stat1008.

This was configured to run as a systemd user service, which simply executed bin/elasticsearch

The only configuration required was to disable security.

Indices were pre-created according to the steps recorded at T299703

The only issue was that the index lifecycle policy could not be applied, but this would not necessarily pose a significant problem for a prototype. We would be able to work out similar settings for a production version.

Graph Database Services

A binary distribution of Neo4J community edition version 4.0.6 was extracted to /home/btullis/src/datahub/neo4j on stat1008.

This was configured to run as a systemd user service, which simply executed bin/neo4j console

The only configuration required was to set the default username/password to neo4j/datahub and to enable the bolt authentication mechanism.

Ingestion Configuration

Once all of the services were running, we could move onto the ingestion side. The recipes for ingestion are clearly explained and appear well-polished in comparison with the other systems evaluated.

All ingestion components use Python, so a simple conda environment was created and all plugins were install using pip, for example:

pip install 'acryl-datahub[datahub-rest]'

Hive Ingestion

Ths method used a connection to the Hive Server2 server, as oppsed to the Metastore (as used by Atlas) or to Hive's MySQL database (as used by Amundsen).

It required a working pyhive configuration and used a user's own Kerberos ticket.

The following recipe ingested all of our Hive tables, with the exception of one which caused an error and had to be omitted.

source:
  type: "hive"
  config:
    host_port: analytics-hive.eqiad.wmnet:10000
    options:
      connect_args:
        auth: 'KERBEROS'
        kerberos_service_name: hive
    table_pattern:
      deny:
        - 'gage.webrequest_bad_json'
sink:
  type: "datahub-rest"
  config:
    server: 'http://localhost:8080'

This pipeline was then executed with the command datahub ingest -c hive.yml

Kafka Ingestion

We ingested kafka topic names from the kafka-jumbo cluster. No schema information was accociated with the toic names, although automatic schema association might be possible to add if we make more effective use of the schema registry component. The following recipe ingested the kafka topics.

source:
  type: "kafka"
  config:
    connection:
      bootstrap: "kafka-jumbo1001.eqiad.wmnet:9092"
      schema_registry_url: http://localhost:8081
sink:
  type: "datahub-rest"
  config:
    server: 'http://localhost:8080'
Druid Ingestion

Druid was particularly simple to ingest, given the lack of authentication on the data sources at the moment. We ingested data from both the analytics cluster and from the public cluster.

The result was 27 datasets with full schema information.

Progress Status

Progress with DataHub was good. It would have ben nice to have been able to spend a little more time looking at the Airflow based ingestion and the support for lineage, but we moved onto other evaluation candidates after successfully ingesting Hive, Kafka, and Druid.

Perceptions

DataHub seems like a well-managed project with a vibrant community and solid backing from a commercial entity (LinkedIn) who has a proven track-record of open-source project support.

It's true that it's a pre-1.0 product and that some features such as data quality reporting and granular authorization are not yet finished, but the project's roadmap shows that they are high on the agenda.

The community has been responsive to all of our questions and has offered to make their engineering staff available to us in a private Slack channel in support of our MVP.

Outcome

DataHub has been proposed as the primary candiate to be taken forward to a full MVP phase and hopefully a subsequent production deployment.