Data Platform/Dataset archiving and deletion
Appearance
Motivation
Deprecating, archiving or deleting datasets helps maintain system efficiency, security and compliance as well as improves data relevance and accuracy.
Process
Identification and Review
- Candidates for depreciation can be nominated any time. Whenever a respective team works in an area where a legacy dataset qualifies for deprecation it should be flagged.
- Integrated into the quarterly planning the Data Engineering (DE) team, with input from data stewards and others, identifies candidates for dataset deprecation.
- Criteria for candidates
- Expired TTL
- Stale data, no updates in last x months
- No downstream dependencies
- Expired ownership
- Criteria for candidates
- The review may also be triggered by deprecation of a legacy system, data migration needs, disk space constraints, or other factors such as usage or dependency metrics.
Pre-deprecation Preparations
- In case a data steward exists for the candidate data set, the steward confirms that the dataset can be deprecated without adverse effects. Otherwise the data engineering team performs the mandatory investigations.
- This involves:
- Impact Analysis
- Identifying and verifying downstream datasets, data pipelines, and dashboards derived from the dataset. Checks against dependencies are primarily made through Datahub, Airflow, GitHub and Notebook repositories.
- Phabricator ticket
- Creating a Phabricator ticket to document the reasons for deprecation, affected downstream assets, and tagging relevant stakeholders.
- Communication (2x)
- Posting the intent to deprecate on relevant Slack channels (#data-engineering-collab, #working-with-data) and data-related email lists to provide a grace period for feedback. This should be repeated at least once before the deprecation deadline.
- Impact Analysis
Feedback Grace Period
- The grace period for feedback should be 30 days. Multiple reminders should be sent in advance.
Deprecation Execution
- Establish child tasks for deprecating associated dashboards, pipelines, and downstream datasets, and sets a timeline for completion.
- After confirming all affected data usages have migrated or been deprecated, the data steward officially deprecates or archives the dataset.
Archiving vs. Deletion Decision
- Post-deprecation, the decision to archive or delete the dataset is made based on:
- The impact of keeping the dataset around (e.g., risks, disk space).
- The presence of subscribers or dependents.
- Core vs. non-core dataset considerations.
- Official channels (#data-engineering-collab and #working-with-data) are used for communications.
Archiving
- If archiving is chosen, the dataset is renamed by adding an "_archived" suffix, indicating it is no longer active but still accessible for historical reference or legal compliance.
- If technically possible the dataset should be marked as read-only.
- The archived status is documented in DataHub.
Deletion
- All data, data definitions, and associated code are removed.
- The deletion is documented in Phabricator and communicated through established channels.