Data Engineering/Evaluations/2021 data catalog selection

From Wikitech

Data Catalog Evaluation

What is a Data Catalog?

A data catalog is an inventory of data asset metadata that allows data consumers to discover and evaluate data for analytical and Product uses. Data Catalogs focus on addressing the issues of findability, accessibility, interoperability, and re-use – the four principles of FAIR data - which have proven to be critical bottlenecks in data management if left unaddressed. In addition it also proves a valuable tool for enabling data governance and data management by providing an interface for Data Definitions, Provenance, and Access Control.

Problem Statement

Our data lake has served as the primary repository of data stored in its raw and processed formats at the foundation. It has enabled us to store and analyze the vast amounts of data that result from our users interacting with our projects. However, simply centralizing and storing our data in our data lake has not solved-for our critical data management challenges such as data findability, accessibility, interoperability, and re-use. Currently we try to meet these needs with dataset documentation on Wikitech and metadata descriptions in schemas and in Hive tables. This has become costly to maintain and ultimately insufficient for enabling the FAIR data principles as we scale our data practices.

Interest in the data collected by our systems has been growing dramatically in the past few years with the introduction of new features and an increasing focus on evidencing our decisions using data. This has increased the urgency for an enhanced set of data management tools to address these challenges. One such tool that we are investigating is a Data Catalog.

Impact Hypothesis

By successfully implementing and integrating a catalog solution as part of our data management strategy, we would bring our data ecosystem more inline with the FAIR data principles which would enable more of the organization to be less reliant on our analytics teams.

Evaluation Candidates

Atlas Amundsen Datahub Open Metadata


WMF Functional Requirements

For a solution to be considered complete it would need to support us in achieving the following functional requirements, now or in the future. Solutions will be evaluated in how they do against these functionality requirements which will form the basis of our evaluation. To do this we plan to run a timeboxed MVP that will test a solution and establish how many of these requirements can be met.


Functional Key Functionality Requirements Description
Ingestion Integration with underlying data stores to import metadata through data connectors The Data Catalog should be able to ingest structured or semi-structured metadata, and must support ingesting metadata from the entire organization.
Ability to connect to the catalog via API for integration with automated processes and applications Data catalog should support automated discovery and ingestion of data sets, both for initial catalog build and ongoing discovery of new data sets.
Track data lineage Ability to trace data from the original source, through analysis and reporting processes
Track data usage Should support the ability to collect information about each data set including: Who has used the data set? For what use cases has it been used? How frequently is it used? With what other data sets is it typically used or combined?
Track metadata changes across dataset versions Track the changes and provide a version history of any dataset included in the catalog.
Usability Dataset Evaluation Add annotations, create custom metadata fields, add search terms and tags, identify stewards and SMEs, tag security and compliance sensitive data fields.
Dataset Visibility The ability to manually add, hide or remove datasets.
Capture User Feedback Enable social capabilities such as Org-Sourcing of metadata, sharing features, posting of user ratings and reviews, and capture of user feedback.
Ability to search for datasets Robust search capabilities include search by facets, keywords, and business terms. Natural language search capabilities are especially valuable for non-technical users. Ranking of search results by relevance and by frequency of use are particularly useful and beneficial features.
Interface Usability Include capabilities to preview a dataset, view data profiles, see user ratings, read user reviews and curator annotations, and view data quality information.
Security Dataset access management Data access should be imposed at dataset level, record/row level, column/field level, and by value.
Fine grained ACL (access control lists) for catalog metadata access including data masking User security should at minimum distinguish between administrative users, and analytic users and data stewards - all of which should have their own security profile
Ability to provide public access to discover all datasets Public datasets are useful to our technical community members who build critical infrastructure. Public datasets are also useful to the broader research network outside WMF, and a critical part of the free knowledge ecosystem. Public information about private datasets is useful in delineating what projects can happen without formal WMF collaboration.

MVP Goals

Scope: Deploy a data catalog solution cataloging Hive datasets and Kafka datasets streamed through the Event Platform


Functional Requirements

[Primary] Searching and filtering options to allow users to quickly find relevant sets of data for analytics or data engineering requirements.

[Extended] Provide a way for subject matter experts to contribute business knowledge eg. Glossary, tags, associations, user-defined annotations, classifications, ratings, etc.

Technical Requirements:

[Primary] Have the complete Hive Metastore imported into the Data Catalog

[Extended] Event Platform Schemas and Streams imported into the Data Catalog

[Stretch] Airflow lineage included

Milestones:


Milestone Details
Complete feature matrix https://phabricator.wikimedia.org/T299887
Plan for Productionising Complete. https://phabricator.wikimedia.org/T299888
Have the selected solution deployed and connect one dataset to it. https://phabricator.wikimedia.org/T299897
Connect remaining data stores and test required functionality https://phabricator.wikimedia.org/T299899
Demo Solution https://phabricator.wikimedia.org/T299910



Technical MVP Evaluation

Implementation Considerations
Requirement Atlas DataHub OpenMetadata Amundsen
Sync Hive How often can changes get synced Continuous Continuous Almost daily Every 2 hours
Sync Airflow How often can changes get pushed Continuous Continuous Continuous Unknown
Automated Classifier Ingestion and changes to the the state are automatically synced Yes* Limited No Yes*
Months to productionize Ready to T2 Service 9 to 12 4 to 6 4 to 6 6 to 8
Community now How quickly the community responds Inactive Active Active Uncertain
Search Capabilities
Requirement Atlas DataHub OpenMetadata Amundsen
Imported metadata fields Yes* Yes Yes Yes
System (eg: classifiers) Yes* Yes Yes Yes
Description text Yes* Yes Yes Yes
Popularity, rating, etc. Yes* Yes Yes Yes
Possible from a GUI
Requirement Atlas DataHub OpenMetadata Amundsen
Manage stewardship Yes* Yes Yes Yes
Report quality issue Yes* No*** Yes No?
See Quality in Lineage Yes* Limited Planned Yes
See Classifiers in Lineage Yes* Limited Yes Yes
See Dashboards in Lineage Yes* Yes Yes Yes
Glossary: use and update Yes* Yes Planned Atlas
Superset integration Limited* Yes Yes Yes
MVP Stretch Goals Features
Requirement Atlas DataHub OpenMetadata Amundsen
Metadata Ingestion: MySQL No Yes Yes Yes
Metadata Ingestion: Hive metastore, Kafka topic metadata, Druid, Cassandra, dashboard metadata Limited Yes Yes Yes
Column-level lineage Planned Planned Yes
Any access-related requirement Yes* Yes** Planned Atlas


*Not Supported with our current stack **Supports LDAP, fine grained on roadmap ***Coming Soon

Notes on the Candidates

DataHub

We chose DataHub for our MVP because it fit best in our current environment. We have OpenSearch deployments, a MariaDB cluster it’s compatible with, we already have Kafka deployed, and so on. Ingestion for the metadata we care about was easy and flexible. We like that Kafka holds everything together because we have to allow public access to the catalog in some way. The main hesitation with DataHub is around the pieces of the LinkedIn / Confluent ecosystem that we are not using. Pegasus is used internally for schemas, and we shouldn’t have to interface with it, but JSON Schema would’ve been easier. Confluent Schema Registry or Karapace are dependencies now, and ideally we wouldn’t have to set those up, but there’s an open question with the DataHub community as to whether they can eliminate the dependency.


Amundsen

Amundsen is a great candidate, a simple set of python services provide all the functionality and make deployment simple. Ultimately the sources we want to ingest metadata from were just a bit harder to configure. But there’s a lot of great UX in Amundsen that’s worth revisiting; from social to features connected with good data governance practice. There’s a sense in Amundsen that folks experienced with data governance are steering the product in the right direction. This evaluation taught us so much, and we’re thankful for all the valuable content we found, like this great write-up on snapshot extraction vs data extraction, and how data catalogs tend to fail or succeed.


OpenMetadata

A really good solution, very easy to deploy with or without docker. Ultimately it just didn’t fit as well in our environment. It needs MySQL 8+ and uses features that we don’t have in MariaDB 10.4, so we would have to set up a separate cluster and support it ourselves (we’re a very small team doing much more than the data catalog). Hive ingestion is getting better, but not quite ready for our use case. Lots of good things to revisit here. Reliance on simpler open standards like JSON Schema, amazing and responsive community fixing issues as fast as we reported them, great UI with a great user experience, and clearly an eye towards a simpler data governance solution.


Atlas

Honestly we had high hopes for Atlas, but the community seems mostly unresponsive and they have no backwards compatibility with the version of Hive we use, so the hurdles were too big.


Some more notes from this evaluation process and other candidates we looked at available here: Data_Catalog_Application_Evaluation/Rubric