Data Platform/Evaluations/2021 data catalog selection

Data Catalog Evaluation

What is a Data Catalog?

A data catalog is an inventory of data asset metadata that allows data consumers to discover and evaluate data for analytical and Product uses. Data Catalogs focus on addressing the issues of findability, accessibility, interoperability, and re-use – the four principles of FAIR data - which have proven to be critical bottlenecks in data management if left unaddressed. In addition it also proves a valuable tool for enabling data governance and data management by providing an interface for Data Definitions, Provenance, and Access Control.

Problem Statement

Our data lake has served as the primary repository of data stored in its raw and processed formats at the foundation. It has enabled us to store and analyze the vast amounts of data that result from our users interacting with our projects. However, simply centralizing and storing our data in our data lake has not solved-for our critical data management challenges such as data findability, accessibility, interoperability, and re-use. Currently we try to meet these needs with dataset documentation on Wikitech and metadata descriptions in schemas and in Hive tables. This has become costly to maintain and ultimately insufficient for enabling the FAIR data principles as we scale our data practices.

Interest in the data collected by our systems has been growing dramatically in the past few years with the introduction of new features and an increasing focus on evidencing our decisions using data. This has increased the urgency for an enhanced set of data management tools to address these challenges. One such tool that we are investigating is a Data Catalog.

Impact Hypothesis

By successfully implementing and integrating a catalog solution as part of our data management strategy, we would bring our data ecosystem more inline with the FAIR data principles which would enable more of the organization to be less reliant on our analytics teams.

Evaluation Candidates

Atlas

Amundsen

Datahub

Open Metadata

WMF Functional Requirements

For a solution to be considered complete it would need to support us in achieving the following functional requirements, now or in the future. Solutions will be evaluated in how they do against these functionality requirements which will form the basis of our evaluation. To do this we plan to run a timeboxed MVP that will test a solution and establish how many of these requirements can be met.

Functional	Key Functionality Requirements	Description
Ingestion	Integration with underlying data stores to import metadata through data connectors	The Data Catalog should be able to ingest structured or semi-structured metadata, and must support ingesting metadata from the entire organization.
	Ability to connect to the catalog via API for integration with automated processes and applications	Data catalog should support automated discovery and ingestion of data sets, both for initial catalog build and ongoing discovery of new data sets.
	Track data lineage	Ability to trace data from the original source, through analysis and reporting processes
	Track data usage	Should support the ability to collect information about each data set including: Who has used the data set? For what use cases has it been used? How frequently is it used? With what other data sets is it typically used or combined?
	Track metadata changes across dataset versions	Track the changes and provide a version history of any dataset included in the catalog.
Usability	Dataset Evaluation	Add annotations, create custom metadata fields, add search terms and tags, identify stewards and SMEs, tag security and compliance sensitive data fields.
	Dataset Visibility	The ability to manually add, hide or remove datasets.
	Capture User Feedback	Enable social capabilities such as Org-Sourcing of metadata, sharing features, posting of user ratings and reviews, and capture of user feedback.
	Ability to search for datasets	Robust search capabilities include search by facets, keywords, and business terms. Natural language search capabilities are especially valuable for non-technical users. Ranking of search results by relevance and by frequency of use are particularly useful and beneficial features.
	Interface Usability	Include capabilities to preview a dataset, view data profiles, see user ratings, read user reviews and curator annotations, and view data quality information.
Security	Dataset access management	Data access should be imposed at dataset level, record/row level, column/field level, and by value.
	Fine grained ACL (access control lists) for catalog metadata access including data masking	User security should at minimum distinguish between administrative users, and analytic users and data stewards - all of which should have their own security profile
	Ability to provide public access to discover all datasets	Public datasets are useful to our technical community members who build critical infrastructure. Public datasets are also useful to the broader research network outside WMF, and a critical part of the free knowledge ecosystem. Public information about private datasets is useful in delineating what projects can happen without formal WMF collaboration.

MVP Goals

Scope: Deploy a data catalog solution cataloging Hive datasets and Kafka datasets streamed through the Event Platform

Functional Requirements

[Primary] Searching and filtering options to allow users to quickly find relevant sets of data for analytics or data engineering requirements.

[Extended] Provide a way for subject matter experts to contribute business knowledge eg. Glossary, tags, associations, user-defined annotations, classifications, ratings, etc.

Technical Requirements:

[Primary] Have the complete Hive Metastore imported into the Data Catalog

[Extended] Event Platform Schemas and Streams imported into the Data Catalog

[Stretch] Airflow lineage included

Milestones:

Milestone	Details
Complete feature matrix	https://phabricator.wikimedia.org/T299887
Plan for Productionising Complete.	https://phabricator.wikimedia.org/T299888
Have the selected solution deployed and connect one dataset to it.	https://phabricator.wikimedia.org/T299897
Connect remaining data stores and test required functionality	https://phabricator.wikimedia.org/T299899
Demo Solution	https://phabricator.wikimedia.org/T299910

Technical MVP Evaluation

Implementation Considerations
Requirement		Atlas	DataHub	OpenMetadata	Amundsen
Sync Hive	How often can changes get synced	Continuous	Continuous	Almost daily	Every 2 hours
Sync Airflow	How often can changes get pushed	Continuous	Continuous	Continuous	Unknown
Automated Classifier	Ingestion and changes to the the state are automatically synced	Yes*	Limited	No	Yes*
Months to productionize	Ready to T2 Service	9 to 12	4 to 6	4 to 6	6 to 8
Community now	How quickly the community responds	Inactive	Active	Active	Uncertain
Search Capabilities
Requirement		Atlas	DataHub	OpenMetadata	Amundsen
Imported metadata fields		Yes*	Yes	Yes	Yes
System (eg: classifiers)		Yes*	Yes	Yes	Yes
Description text		Yes*	Yes	Yes	Yes
Popularity, rating, etc.		Yes*	Yes	Yes	Yes
Possible from a GUI
Requirement		Atlas	DataHub	OpenMetadata	Amundsen
Manage stewardship		Yes*	Yes	Yes	Yes
Report quality issue		Yes*	No***	Yes	No?
See Quality in Lineage		Yes*	Limited	Planned	Yes
See Classifiers in Lineage		Yes*	Limited	Yes	Yes
See Dashboards in Lineage		Yes*	Yes	Yes	Yes
Glossary: use and update		Yes*	Yes	Planned	Atlas
Superset integration		Limited*	Yes	Yes	Yes
MVP Stretch Goals Features
Requirement		Atlas	DataHub	OpenMetadata	Amundsen
Metadata Ingestion: MySQL		No	Yes	Yes	Yes
Metadata Ingestion: Hive metastore, Kafka topic metadata, Druid, Cassandra, dashboard metadata		Limited	Yes	Yes	Yes
Column-level lineage			Planned	Planned	Yes
Any access-related requirement		Yes*	Yes**	Planned	Atlas

*Not Supported with our current stack

**Supports LDAP, fine grained on roadmap

***Coming Soon

Notes on the Candidates

DataHub

We chose DataHub for our MVP because it fit best in our current environment. We have OpenSearch deployments, a MariaDB cluster it’s compatible with, we already have Kafka deployed, and so on. Ingestion for the metadata we care about was easy and flexible. We like that Kafka holds everything together because we have to allow public access to the catalog in some way. The main hesitation with DataHub is around the pieces of the LinkedIn / Confluent ecosystem that we are not using. Pegasus is used internally for schemas, and we shouldn’t have to interface with it, but JSON Schema would’ve been easier. Confluent Schema Registry or Karapace are dependencies now, and ideally we wouldn’t have to set those up, but there’s an open question with the DataHub community as to whether they can eliminate the dependency.

Amundsen

Amundsen is a great candidate, a simple set of python services provide all the functionality and make deployment simple. Ultimately the sources we want to ingest metadata from were just a bit harder to configure. But there’s a lot of great UX in Amundsen that’s worth revisiting; from social to features connected with good data governance practice. There’s a sense in Amundsen that folks experienced with data governance are steering the product in the right direction. This evaluation taught us so much, and we’re thankful for all the valuable content we found, like this great write-up on snapshot extraction vs data extraction, and how data catalogs tend to fail or succeed.

OpenMetadata

A really good solution, very easy to deploy with or without docker. Ultimately it just didn’t fit as well in our environment. It needs MySQL 8+ and uses features that we don’t have in MariaDB 10.4, so we would have to set up a separate cluster and support it ourselves (we’re a very small team doing much more than the data catalog). Hive ingestion is getting better, but not quite ready for our use case. Lots of good things to revisit here. Reliance on simpler open standards like JSON Schema, amazing and responsive community fixing issues as fast as we reported them, great UI with a great user experience, and clearly an eye towards a simpler data governance solution.

Atlas

Honestly we had high hopes for Atlas, but the community seems mostly unresponsive and they have no backwards compatibility with the version of Hive we use, so the hurdles were too big.

Some more notes from this evaluation process and other candidates we looked at available here: Data_Catalog_Application_Evaluation/Rubric