2021 data catalog selection
Data Catalog Evaluation
What is a Data Catalog?
A data catalog is an inventory of data asset metadata that allows data consumers to discover and evaluate data for analytical and Product uses. Data Catalogs focus on addressing the issues of findability, accessibility, interoperability, and re-use – the four principles of FAIR data - which have proven to be critical bottlenecks in data management if left unaddressed. In addition it also proves a valuable tool for enabling data governance and data management by providing an interface for Data Definitions, Provenance, and Access Control.
Our data lake has served as the primary repository of data stored in its raw and processed formats at the foundation. It has enabled us to store and analyze the vast amounts of data that result from our users interacting with our projects. However, simply centralizing and storing our data in our data lake has not solved-for our critical data management challenges such as data findability, accessibility, interoperability, and re-use. Currently we try to meet these needs with dataset documentation on Wikitech and metadata descriptions in schemas and in Hive tables. This has become costly to maintain and ultimately insufficient for enabling the FAIR data principles as we scale our data practices.
Interest in the data collected by our systems has been growing dramatically in the past few years with the introduction of new features and an increasing focus on evidencing our decisions using data. This has increased the urgency for an enhanced set of data management tools to address these challenges. One such tool that we are investigating is a Data Catalog.
By successfully implementing and integrating a catalog solution as part of our data management strategy, we would bring our data ecosystem more inline with the FAIR data principles which would enable more of the organization to be less reliant on our analytics teams.
WMF Functional Requirements
For a solution to be considered complete it would need to support us in achieving the following functional requirements, now or in the future. Solutions will be evaluated in how they do against these functionality requirements which will form the basis of our evaluation. To do this we plan to run a timeboxed MVP that will test a solution and establish how many of these requirements can be met.
|Functional||Key Functionality Requirements||Description|
|Ingestion||Integration with underlying data stores to import metadata through data connectors||The Data Catalog should be able to ingest structured or semi-structured metadata, and must support ingesting metadata from the entire organization.|
|Ability to connect to the catalog via API for integration with automated processes and applications||Data catalog should support automated discovery and ingestion of data sets, both for initial catalog build and ongoing discovery of new data sets.|
|Track data lineage||Ability to trace data from the original source, through analysis and reporting processes|
|Track data usage||Should support the ability to collect information about each data set including: Who has used the data set? For what use cases has it been used? How frequently is it used? With what other data sets is it typically used or combined?|
|Track metadata changes across dataset versions||Track the changes and provide a version history of any dataset included in the catalog.|
|Usability||Dataset Evaluation||Add annotations, create custom metadata fields, add search terms and tags, identify stewards and SMEs, tag security and compliance sensitive data fields.|
|Dataset Visibility||The ability to manually add, hide or remove datasets.|
|Capture User Feedback||Enable social capabilities such as Org-Sourcing of metadata, sharing features, posting of user ratings and reviews, and capture of user feedback.|
|Ability to search for datasets||Robust search capabilities include search by facets, keywords, and business terms. Natural language search capabilities are especially valuable for non-technical users. Ranking of search results by relevance and by frequency of use are particularly useful and beneficial features.|
|Interface Usability||Include capabilities to preview a dataset, view data profiles, see user ratings, read user reviews and curator annotations, and view data quality information.|
|Security||Dataset access management||Data access should be imposed at dataset level, record/row level, column/field level, and by value.|
|Fine grained ACL (access control lists) for catalog metadata access including data masking||User security should at minimum distinguish between administrative users, and analytic users and data stewards - all of which should have their own security profile|
|Ability to provide public access to discover all datasets||Public datasets are useful to our technical community members who build critical infrastructure. Public datasets are also useful to the broader research network outside WMF, and a critical part of the free knowledge ecosystem. Public information about private datasets is useful in delineating what projects can happen without formal WMF collaboration.|
Scope: Deploy a data catalog solution cataloging Hive datasets and Kafka datasets streamed through the Event Platform
[Primary] Searching and filtering options to allow users to quickly find relevant sets of data for analytics or data engineering requirements.
[Extended] Provide a way for subject matter experts to contribute business knowledge eg. Glossary, tags, associations, user-defined annotations, classifications, ratings, etc.
[Primary] Have the complete Hive Metastore imported into the Data Catalog
[Extended] Event Platform Schemas and Streams imported into the Data Catalog
[Stretch] Airflow lineage included
|Complete feature matrix||https://phabricator.wikimedia.org/T299887|
|Plan for Productionising Complete.||https://phabricator.wikimedia.org/T299888|
|Have the selected solution deployed and connect one dataset to it.||https://phabricator.wikimedia.org/T299897|
|Connect remaining data stores and test required functionality||https://phabricator.wikimedia.org/T299899|
Technical MVP Evaluation
|Sync Hive||How often can changes get synced||Continuous||Continuous||Almost daily||Every 2 hours|
|Sync Airflow||How often can changes get pushed||Continuous||Continuous||Continuous||Unknown|
|Automated Classifier||Ingestion and changes to the the state are automatically synced||Yes*||Limited||No||Yes*|
|Months to productionize||Ready to T2 Service||9 to 12||4 to 6||4 to 6||6 to 8|
|Community now||How quickly the community responds||Inactive||Active||Active||Uncertain|
|Imported metadata fields||Yes*||Yes||Yes||Yes|
|System (eg: classifiers)||Yes*||Yes||Yes||Yes|
|Popularity, rating, etc.||Yes*||Yes||Yes||Yes|
|Possible from a GUI|
|Report quality issue||Yes*||No***||Yes||No?|
|See Quality in Lineage||Yes*||Limited||Planned||Yes|
|See Classifiers in Lineage||Yes*||Limited||Yes||Yes|
|See Dashboards in Lineage||Yes*||Yes||Yes||Yes|
|Glossary: use and update||Yes*||Yes||Planned||Atlas|
|MVP Stretch Goals Features|
|Metadata Ingestion: MySQL||No||Yes||Yes||Yes|
|Metadata Ingestion: Hive metastore, Kafka topic metadata, Druid, Cassandra, dashboard metadata||Limited||Yes||Yes||Yes|
|Any access-related requirement||Yes*||Yes**||Planned||Atlas|
|*Not Supported with our current stack||**Supports LDAP, fine grained on roadmap||***Coming Soon|
Notes on the Candidates
We chose DataHub for our MVP because it fit best in our current environment. We have OpenSearch deployments, a MariaDB cluster it’s compatible with, we already have Kafka deployed, and so on. Ingestion for the metadata we care about was easy and flexible. We like that Kafka holds everything together because we have to allow public access to the catalog in some way. The main hesitation with DataHub is around the pieces of the LinkedIn / Confluent ecosystem that we are not using. Pegasus is used internally for schemas, and we shouldn’t have to interface with it, but JSON Schema would’ve been easier. Confluent Schema Registry or Karapace are dependencies now, and ideally we wouldn’t have to set those up, but there’s an open question with the DataHub community as to whether they can eliminate the dependency.
Amundsen is a great candidate, a simple set of python services provide all the functionality and make deployment simple. Ultimately the sources we want to ingest metadata from were just a bit harder to configure. But there’s a lot of great UX in Amundsen that’s worth revisiting; from social to features connected with good data governance practice. There’s a sense in Amundsen that folks experienced with data governance are steering the product in the right direction. This evaluation taught us so much, and we’re thankful for all the valuable content we found, like this great write-up on snapshot extraction vs data extraction, and how data catalogs tend to fail or succeed.
A really good solution, very easy to deploy with or without docker. Ultimately it just didn’t fit as well in our environment. It needs MySQL 8+ and uses features that we don’t have in MariaDB 10.4, so we would have to set up a separate cluster and support it ourselves (we’re a very small team doing much more than the data catalog). Hive ingestion is getting better, but not quite ready for our use case. Lots of good things to revisit here. Reliance on simpler open standards like JSON Schema, amazing and responsive community fixing issues as fast as we reported them, great UI with a great user experience, and clearly an eye towards a simpler data governance solution.
Honestly we had high hopes for Atlas, but the community seems mostly unresponsive and they have no backwards compatibility with the version of Hive we use, so the hurdles were too big.
Some more notes from this evaluation process and other candidates we looked at available here: Data_Catalog_Application_Evaluation/Rubric