Data Engineering/Systems/DataHub/Data Catalog Documentation Guide
This page is currently a draft.
More information and discussion about changes to this draft on the talk page.
The Datahub data catalog is the central repository of metadata about datasets stored in the systems operated by WMF. It is currently available under a login to all users who are in the wmf or nda LDAP groups. These users have permissions to search the datasets and contribute to the documentation.
The guidelines outlined here define standards to facilitate a high level of quality and consistency in the documentation.They are an extension of the official MediaWiki Documentation Style Guide. The section about language in particular applies.
Data Catalog Core Entities (Data Assets)
- Datasets - we currently integrate only with Druid, Kafka and Hive
- Charts - a single visualization derived from a dataset. Imported from Superset
- Dashboards - a collection of charts for visualization. Imported from Superset
- Data Task - an executable job that processes data assets, where "processing" implies consuming data, producing data, or both.
- Data Pipeline - an executable collection of Data Jobs with dependencies among them
Please note: we encourage users to only work with datasets for the time being. We are still in the process of determining how best to import and use the dashboard and chart metadata from Superset, as well as data pipeline information.
Types of Metadata
- Dataset level documentation
- Data schema/field descriptions
- Tags - Tags are informal, loosely controlled labels that help in search & discovery. They can be added to datasets, dataset schemas, or containers, for an easy way to label or categorize entities – without having to associate them to a broader business glossary or vocabulary
- Business Glossary
- Terms: words or phrases with a specific Wikimedia definition assigned to them
- Term Groups: act like folders, containing Terms and even other Term Groups to allow for a nested structure
- Use team assignments where possible
- Assign Business and Technical Owners, leaving the Data Steward field open until we formally establish this role as part of the data governance program.
- When deciding ownership, use the following guidelines
- Business Owners
- A person or group who is responsible for logical, or business related, aspects of the asset.
- Until we officially define the data stewardship role, this role will cover the responsibilities usually associated with Data Stewards:
- Subject matter expertise
- Responsible for managing the metadata
- Defines data quality expectations
- Communicating about data-related issues and providing guidance and support as needed.
- Responsible for making decisions about the dataset, including data lifecycle decisions
- Ensures that the data is adhering to the data policies and is meeting the SLOs
- Technical Owner
- A person or group who is responsible for technical aspects of the asset.
- Data Steward - as stated above leave unassigned until the role is formally established
- Business Owners
- The list of domains is still TBD
- Use all lowercase letters
- Use a single word where possible. Keep it to less than 3
- Use spaces to separate words in multi word tags
- Use tags to define broad data categories
Glossary Terms and Groupings
- Use title capitalization as per WMF guidelines
- Start the description of the term with one or two sentences with the description being clear enough so that everyone reading it can understand the defined term
- Avoid using jargon and do not assume any prior knowledge
- Link to other related glossary terms when appropriate
- Provide links to additional information that is available on one of the wikis
Dataset Schema Documentation (Fields)
- Provide a description of the field that defines it in simplified terms
- Use a sentence fragment. If more than one sentence is needed to fully define a field, add a full stop after the first phrase and use full sentences in continuation. E.g. “Number of editors who had 5 or more content edits” , “MediaWiki page_id for this page title. For redirects this could be the page_id of the redirect or the page_id of the target.”
- Start with a noun that describes the data stored in the field.Do not prefix with the “the” article or superfluous wording such as “This data field”.
- Description should start with a capital letter
- Do not use a full stop at the end of the description
- Provide example field values where applicable
- Provide information about units, data format (eg date & time, spatial) or international standards used (TBD whether tags should be used to designate standards)
- Use the “Add glossary” feature to attach to a glossary term for fields that can be linked to a glossary term. When a glossary term is not available please consider adding one.
- Use tags to categorize a field e.g. designate as PII (private identifiable information)
- Add non trivial information about the field (e.g computation, evolution and known issues) to the dataset documentation and not in the field description
- Use full sentences for the dataset documentation
- Start the description of the dataset with a high level description, 1-3 sentences at most. The description should be clear enough so that everyone reading it can understand.
- Add more information in additional sections
- Use sentence capitalization for section headings
- Provide information about the dataset evolution and known issues
- Provide links to external references - e.g. code, sample queries, list of known quality issues.
- Aggregated vs pre-aggregated data - use the term “aggregated”
- Pageview vs page view - use “pageview”