Data Engineering/Evaluations/2021 data catalog selection/Rubric

From Wikitech

Evaluating potential data catalog solutions. This relates to the Medium Term Plan: Data-as-a-Service objective.


Requirements

Since this is a central and critical part of a healthy data culture, a multitude of requirements with complex inter-relationships factor into our decision. This section attempts to highlight the requirements that we think would most impact us over the first year that the data catalog is available to users.

  • Easily ingest metadata from various parts of our Shared Infrastructure
    • lineage: Airflow, Gobblin, custom ingestion (Flink, Spark)
    • tables and columns: Hive metastore, Druid, Cassandra, Elastic, custom jobs writing to HDFS (Spark (also relevant to our Airflow dag strategy))
    • lots of other metadata: ownership, location, quality, automatic classifications. All tools have places to store these, but automation here is key, as it saves the souls of Data Stewards.
  • Authentication. Which of our many sign-on mechanisms are we going to use and does the tool support it?
  • Authorization. Are we going to need fine-grained control on data, are we going to deploy Apache Ranger? Do we need to reflect the way we treat data in the metadata layer? Can we allow anyone to edit any metadata?
  • UX. We are not a mature data savvy organization. This will be a lot of people's first experience with our data landscape. We need to make it pleasant to deliver on our goals of making data a first class citizen here at WMF.
  • Search. This is part of UX but it's a major component of most of the candidates, and, for example, Atlas is supposed to have not so great search. This makes it an important stand-alone consideration.
  • Speed of Ingestion. Do we need real-time updates to our metadata? Are we planning automated responses to certain changes in a way that would be easily centralized on top of the metadata catalog?
  • Privacy (data retention, transformations, compliance). One of the things we pride ourselves in is privacy, retaining only what we need for as long as we need. An overall picture of how compliant we are could be different from our self-assessment, so this seems like an important consideration. Maybe more generally, we need a high level overview of data we keep and how it meets or fails to meet a privacy budget.

This list means we're not focusing on some of the other aspects of data governance right now. Things like spelling out policies and tracking compliance, following strict processes, and so on. These seem possible to build on top of most of the solutions we're looking at.

General Considerations

  • Ingestion is going to be a big deal. For the parts of our data platform that do not expose metadata in a convenient way, we need to build custom metadata ingestion. Some of the tools above make this easier than others.
  • We should carefully survey the list of connectors. Everyone says "we have a flexible connector architecture", "the community builds lots of high quality connectors", etc, but when you look closer you might find poor support for something we need. Like this lack of support for Spark in Atlas.
  • Give extra points for tight integrations like using DataHub as a Lineage backend for AirFlow.

Candidates

Click on each header name to see the in-depth evaluation for each application as a separate article.

Name Atlas Amundsen DataHub OpenMetadata Egeria Marquez
Tagline Atlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem. Open source data discovery and metadata engine The Metadata Platform for the Modern Data Stack Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right. Open metadata and governance for enterprises - automatically capturing, managing and exchanging metadata between tools and platforms, no matter the vendor. An open source metadata service for the collection, aggregation, and visualization of a data ecosystem’s metadata. It maintains the provenance of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more.
Release Date 2015 2019 2019 2021 2018 2018
Website https://atlas.apache.org https://www.amundsen.io https://datahubproject.io https://open-metadata.org/ https://odpi.github.io/egeria-docs/ https://marquezproject.github.io/marquez/
Repository https://github.com/apache/atlas https://github.com/amundsen-io/amundsen https://github.com/linkedin/datahub https://github.com/open-metadata/openmetadata
Author Authored by Hortonworks, managed by Apache Lyft LinkedIn Former Hortonworks and Uber employee. Suresh Srinivas LFAI WeWork
License Apache 2.0 Apache 2.0 Apache 2.0 Apache 2.0 Apache 2.0 Apache 2.0
UX Java application via Jetty

New UI is default in version 2.2.0

Legacy UI still available||Flask application

React frontend

Java application via Jetty

React frontend

Initially not considered, now basic exploration is available.
Robustness (criteria TBD) Community support is lacking Ingestion components seem unfinished Difficult to ascertain. No significant issues detected so far. Fairly nascent project
Comment Has a variety of back-end storage options including BerkeleyDB, HBASE, Cassandra Has a variety of back-end storage options, including RDBMS and Neo4J.

Can also make use of Atlas, but it is not a requirement.

Requires MySQL 8.0
  • Very flexible distributed deployment
  • Big names involved (IBM, ING, etc)
  • Good but almost too extensive documentation
  • Great community, but definitely more corporate, more in the Microsoft / IBM open source style
Risks
  • Dependency on other Kafka ecosystem tools like Schema Registry, KafkaStreams, etc. These may not be tightly coupled but LinkedIn doesn't have any incentive to stay away from the Confluent licenses that we can't use, so at any point we could run into a problem here.
  • Authorization seems to be just in the RFC phase, with no LDAP support in the first planned phase.
  • Really solid candidate, lots of stuff like "The OMAG Server Platform is a multi-tenant platform that supports horizontal scale-out in Kubernetes and yet is light enough to run as an edge server on a Raspberry Pi. This platform is used to host the actual metadata integration and automation capabilities."

Other Candidates

With reasons they were not more seriously considered:

Name Metacat Select Star Dataverse CKAN Mediawiki
Tagline Metacat is a unified metadata exploration API service. You can explore Hive, RDS, Teradata, Redshift, S3 and Cassandra. Metacat provides you information about what data you have, where it resides and how to process it. Metadata in the end is really data about the data. So the primary purpose of Metacat is to give a place to describe the data so that we could do more useful things with it. Beyond a data catalog, Select Star is an intelligent data discovery platform that helps you understand your data. Open source research data repository software The world’s leading open source data management system MediaWiki is a collaboration and documentation platform brought to you by a vibrant community.
Link https://github.com/Netflix/metacat https://www.selectstar.com/ https://dataverse.org/ https://ckan.org/ (Licensed GNU AGPL 3.0) https://mediawiki.org
Disqualifying Reasons Documentation is still in the "TODO" phase, no references to community or the kind of organization that Apache projects enjoy, and somewhat limited scope. Closed Source, useful for comparisons This is more of a research sharing tool, not used for generic data governance. See code at https://github.com/IQSS/dataverse and related documentation. CKAN is meant to work at a very large scale, governments with multiple branches collaborating on data hubs. As such, most of the integrations are meant to be done manually, with only minimal automation support. Details can be found in their docs, but it doesn't seem to meet our requirements, it's maybe something to consider for something bigger like an Open Knowledge Data Portal shared with our other open knowledge partners. While building a metadata ingestion and storage layer on top of Mediawiki would be a fun side project, it is clear that the complexity of the competitors' UIs means this would only work if a lightweight, manual process was viable.


Still, high level catalog information should be documented on wikitech.

Evaluation Deployments

We decided to create four evaluation deployments of different candidates, in order to assess their suitability for this requirement.

Ticket Version(s) Deployed Location Backend Components Hive Ingested Druid Ingested Kafka Ingested Notes
Atlas task T296670 2.2.0

HEAD

an-test-coord1001
  • BerkeleyDB
  • Solr
No - Hive version incompatible. Not attempted. Not attempted.
DataHub task T299703 0.8.24 stat1008
  • MariaDB 10.4 (an-test-coord1001)
  • Neo4J 4.0.6 community edition
  • OpenSearch 1.2.4 community edition
  • Kafka 5.4
  • Zookeeper
  • Schema Registry
Yes, production. Using a Kerberos

secured connection to the hive-server2 service via pyhive.

Yes, both public and

analytics clusters.

Yes, topics only.

Not associated with schemas yet

OpenMetadata task T300540 0.8.0 an-test-client1001
  • MySQL 8.0
Yes, test. Using a Kerberos

secured connection to the hive-server2 service via pyhive.

Not attempted. Not attempted
Amundsen task T300756 HEAD
  • search 3.0
  • frontend 4.0
  • metadata 3.10
stat1008
  • Neo4J 3.5.30 community edition
  • Elasticsearch 7.13.3
Yes, production. Using a MySQL

connection to an-coord1001.

Yes, analytics cluster. Not attempted