Data Engineering/Evaluations/2021 data catalog selection/Rubric

Evaluating potential data catalog solutions. This relates to the Medium Term Plan: Data-as-a-Service objective.

Requirements

Since this is a central and critical part of a healthy data culture, a multitude of requirements with complex inter-relationships factor into our decision. This section attempts to highlight the requirements that we think would most impact us over the first year that the data catalog is available to users.

Easily ingest metadata from various parts of our Shared Infrastructure
- lineage: Airflow, Gobblin, custom ingestion (Flink, Spark)
- tables and columns: Hive metastore, Druid, Cassandra, Elastic, custom jobs writing to HDFS (Spark (also relevant to our Airflow dag strategy))
- lots of other metadata: ownership, location, quality, automatic classifications. All tools have places to store these, but automation here is key, as it saves the souls of Data Stewards.
Authentication. Which of our many sign-on mechanisms are we going to use and does the tool support it?
Authorization. Are we going to need fine-grained control on data, are we going to deploy Apache Ranger? Do we need to reflect the way we treat data in the metadata layer? Can we allow anyone to edit any metadata?
UX. We are not a mature data savvy organization. This will be a lot of people's first experience with our data landscape. We need to make it pleasant to deliver on our goals of making data a first class citizen here at WMF.
Search. This is part of UX but it's a major component of most of the candidates, and, for example, Atlas is supposed to have not so great search. This makes it an important stand-alone consideration.
Speed of Ingestion. Do we need real-time updates to our metadata? Are we planning automated responses to certain changes in a way that would be easily centralized on top of the metadata catalog?
Privacy (data retention, transformations, compliance). One of the things we pride ourselves in is privacy, retaining only what we need for as long as we need. An overall picture of how compliant we are could be different from our self-assessment, so this seems like an important consideration. Maybe more generally, we need a high level overview of data we keep and how it meets or fails to meet a privacy budget.

This list means we're not focusing on some of the other aspects of data governance right now. Things like spelling out policies and tracking compliance, following strict processes, and so on. These seem possible to build on top of most of the solutions we're looking at.

General Considerations

Ingestion is going to be a big deal. For the parts of our data platform that do not expose metadata in a convenient way, we need to build custom metadata ingestion. Some of the tools above make this easier than others.
We should carefully survey the list of connectors. Everyone says "we have a flexible connector architecture", "the community builds lots of high quality connectors", etc, but when you look closer you might find poor support for something we need. Like this lack of support for Spark in Atlas.
Give extra points for tight integrations like using DataHub as a Lineage backend for AirFlow.

Candidates

Click on each header name to see the in-depth evaluation for each application as a separate article.

Name	Atlas	Amundsen	DataHub	OpenMetadata	Egeria	Marquez
Tagline	Atlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.	Open source data discovery and metadata engine	The Metadata Platform for the Modern Data Stack	Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.	Open metadata and governance for enterprises - automatically capturing, managing and exchanging metadata between tools and platforms, no matter the vendor.	An open source metadata service for the collection, aggregation, and visualization of a data ecosystem’s metadata. It maintains the provenance of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more.
Release Date	2015	2019	2019	2021	2018	2018
Website	https://atlas.apache.org	https://www.amundsen.io	https://datahubproject.io	https://open-metadata.org/	https://odpi.github.io/egeria-docs/	https://marquezproject.github.io/marquez/
Repository	https://github.com/apache/atlas	https://github.com/amundsen-io/amundsen	https://github.com/linkedin/datahub	https://github.com/open-metadata/openmetadata
Author	Authored by Hortonworks, managed by Apache	Lyft	LinkedIn	Former Hortonworks and Uber employee. Suresh Srinivas	LFAI	WeWork
License	Apache 2.0	Apache 2.0	Apache 2.0	Apache 2.0	Apache 2.0	Apache 2.0
UX	Java application via Jetty New UI is default in version 2.2.0 Legacy UI still available\|\|Flask application React frontend	Java application via Jetty React frontend		Initially not considered, now basic exploration is available.
Robustness (criteria TBD)	Community support is lacking	Ingestion components seem unfinished	Difficult to ascertain. No significant issues detected so far.	Fairly nascent project
Comment	Has a variety of back-end storage options including BerkeleyDB, HBASE, Cassandra	Has a variety of back-end storage options, including RDBMS and Neo4J. Can also make use of Atlas, but it is not a requirement.	Friendly community Well organized town halls RFC process "third generation" (they argue that event-sourcing metadata is key)	Requires MySQL 8.0	Very flexible distributed deployment Big names involved (IBM, ING, etc) Good but almost too extensive documentation Great community, but definitely more corporate, more in the Microsoft / IBM open source style	Requires PostgreSQL uses OpenLineage
Risks			Dependency on other Kafka ecosystem tools like Schema Registry, KafkaStreams, etc. These may not be tightly coupled but LinkedIn doesn't have any incentive to stay away from the Confluent licenses that we can't use, so at any point we could run into a problem here. Authorization seems to be just in the RFC phase, with no LDAP support in the first planned phase.		Really solid candidate, lots of stuff like "The OMAG Server Platform is a multi-tenant platform that supports horizontal scale-out in Kubernetes and yet is light enough to run as an edge server on a Raspberry Pi. This platform is used to host the actual metadata integration and automation capabilities."

Other Candidates

With reasons they were not more seriously considered:


Name	Metacat	Select Star	Dataverse	CKAN	Mediawiki
Tagline	Metacat is a unified metadata exploration API service. You can explore Hive, RDS, Teradata, Redshift, S3 and Cassandra. Metacat provides you information about what data you have, where it resides and how to process it. Metadata in the end is really data about the data. So the primary purpose of Metacat is to give a place to describe the data so that we could do more useful things with it.	Beyond a data catalog, Select Star is an intelligent data discovery platform that helps you understand your data.	Open source research data repository software	The world’s leading open source data management system	MediaWiki is a collaboration and documentation platform brought to you by a vibrant community.
Link	https://github.com/Netflix/metacat	https://www.selectstar.com/	https://dataverse.org/	https://ckan.org/ (Licensed GNU AGPL 3.0)	https://mediawiki.org
Disqualifying Reasons	Documentation is still in the "TODO" phase, no references to community or the kind of organization that Apache projects enjoy, and somewhat limited scope.	Closed Source, useful for comparisons	This is more of a research sharing tool, not used for generic data governance. See code at https://github.com/IQSS/dataverse and related documentation.	CKAN is meant to work at a very large scale, governments with multiple branches collaborating on data hubs. As such, most of the integrations are meant to be done manually, with only minimal automation support. Details can be found in their docs, but it doesn't seem to meet our requirements, it's maybe something to consider for something bigger like an Open Knowledge Data Portal shared with our other open knowledge partners.	While building a metadata ingestion and storage layer on top of Mediawiki would be a fun side project, it is clear that the complexity of the competitors' UIs means this would only work if a lightweight, manual process was viable. Still, high level catalog information should be documented on wikitech.

Evaluation Deployments

We decided to create four evaluation deployments of different candidates, in order to assess their suitability for this requirement.


	Ticket	Version(s) Deployed	Location	Backend Components	Hive Ingested	Druid Ingested	Kafka Ingested
Atlas	task T296670	2.2.0 HEAD	an-test-coord1001	BerkeleyDB Solr	No - Hive version incompatible.	Not attempted.	Not attempted.
DataHub	task T299703	0.8.24	stat1008	MariaDB 10.4 (an-test-coord1001) Neo4J 4.0.6 community edition OpenSearch 1.2.4 community edition Kafka 5.4 Zookeeper Schema Registry	Yes, production. Using a Kerberos secured connection to the hive-server2 service via pyhive.	Yes, both public and analytics clusters.	Yes, topics only. Not associated with schemas yet
OpenMetadata	task T300540	0.8.0	an-test-client1001	MySQL 8.0	Yes, test. Using a Kerberos secured connection to the hive-server2 service via pyhive.	Not attempted.	Not attempted
Amundsen	task T300756	HEAD search 3.0 frontend 4.0 metadata 3.10	stat1008	Neo4J 3.5.30 community edition Elasticsearch 7.13.3	Yes, production. Using a MySQL connection to an-coord1001.	Yes, analytics cluster.	Not attempted