Data Engineering/Systems/DataHub/Data Catalog Documentation Guide

Tracked in Phabricator
Task T349103

The Datahub data catalog is the central repository of metadata about datasets stored in the systems operated by WMF. It is currently available under a login to all users who are in the wmf or nda LDAP groups. These users have permissions to search the datasets and contribute to the documentation.

The guidelines outlined here define standards to facilitate a high level of quality and consistency in the documentation.They are an extension of the official MediaWiki Documentation Style Guide. The section about language in particular applies.

Data Catalog Core Entities (Data Assets)

Datasets - we currently integrate only with Druid, Kafka and Hive
Charts - a single visualization derived from a dataset. Imported from Superset
Dashboards - a collection of charts for visualization. Imported from Superset
Data Task - an executable job that processes data assets, where "processing" implies consuming data, producing data, or both.
Data Pipeline - an executable collection of Data Jobs with dependencies among them

Please note: we encourage users to only work with datasets for the time being. We are still in the process of determining how best to import and use the dashboard and chart metadata from Superset, as well as data pipeline information.

Types of Metadata

Dataset level documentation
Data schema/field descriptions
Tags - Tags are informal, loosely controlled labels that help in search & discovery. They can be added to datasets, dataset schemas, or containers, for an easy way to label or categorize entities – without having to associate them to a broader business glossary or vocabulary (phab:T307710)
Business Glossary
- Terms: words or phrases with a specific Wikimedia definition assigned to them
- Term Groups: act like folders, containing Terms and even other Term Groups to allow for a nested structure
Owners
- Use team assignments where possible
- Assign Business and Technical Owners, leaving the Data Steward field open until we formally establish this role as part of the data governance program.
- When deciding ownership, use the following guidelines
  - Business Owners
    - A person or group who is responsible for logical, or business related, aspects of the asset.
    - Until we officially define the data stewardship role, this role will cover the responsibilities usually associated with Data Stewards:
      - Subject matter expertise
      - Responsible for managing the metadata
      - Defines data quality expectations
      - Communicating about data-related issues and providing guidance and support as needed.
      - Responsible for making decisions about the dataset, including data lifecycle decisions
      - Ensures that the data is adhering to the data policies and is meeting the SLOs
  - Technical Owner
    - A person or group who is responsible for technical aspects of the asset.
  - Data Steward - as stated above leave unassigned until the role is formally established
Domain
- The list of domains is still TBD

Documentation Style Guidelines

Tracked in Phabricator
Task T310229

Glossary Terms and Groupings

Use title capitalization as per WMF guidelines
Start the description of the term with one or two sentences with the description being clear enough so that everyone reading it can understand the defined term
Avoid using jargon and do not assume any prior knowledge
Link to other related glossary terms when appropriate
Provide links to additional information that is available on one of the wikis

Dataset Schema Documentation (Fields)

Provide a description of the field that defines it in simplified terms
Use a sentence fragment. If more than one sentence is needed to fully define a field, add a full stop after the first phrase and use full sentences in continuation. E.g. “Number of editors who had 5 or more content edits” , “MediaWiki page_id for this page title. For redirects this could be the page_id of the redirect or the page_id of the target.”
Start with a noun that describes the data stored in the field.Do not prefix with the “the” article or superfluous wording such as “This data field”.
Description should start with a capital letter
Do not use a full stop at the end of the description
Provide example field values where applicable
Provide information about units, data format (eg date & time, spatial) or international standards used (TBD whether tags should be used to designate standards)
Use the “Add glossary” feature to attach to a glossary term for fields that can be linked to a glossary term. When a glossary term is not available please consider adding one.
Use tags to categorize a field e.g. designate as PII (private identifiable information)
Add non trivial information about the field (e.g computation, evolution and known issues) to the dataset documentation and not in the field description

Dataset Documentation

Use full sentences for the dataset documentation
Start the description of the dataset with a high level description, 1-3 sentences at most. The description should be clear enough so that everyone reading it can understand.
Add more information in additional sections
Use sentence capitalization for section headings
Provide information about the dataset evolution and known issues
Provide links to external references - e.g. code, sample queries, list of known quality issues.

Special Terms

Aggregated vs pre-aggregated data - use the term “aggregated”

Pageview vs page view - use “pageview”