Data lifecycle management

From Wikitech

The data life cycle refers to the sequence of stages that data goes through from its initial creation or capture to its eventual archiving and possible deletion.These stages include:

  • Data Creation. Data is generated or collected from various sources.
  • Data Processing: Data is cleaned, sanitized, and organized for use
  • Data Maintenance: Data Maintenance includes
    • Data Versioning due to schema, semantic or calculation changes
    • Data Quality Monitoring
    • Addressing Data Quality Incidents
    • Storage: Data/metadata is stored in Hive, or other datastores & documented in the data glossary(i.e. datahub)
  • Data Usage. Data is used for analysis, decision-making, reporting, research, etc.
  • Data Sharing and Distribution. Data is shared with or distributed to data consumers outside of WMF.
  • Data Deprecation. Data is marked for archiving, and should no longer be actively used.
  • Data Archiving. Where applicable, data that is no longer actively used is moved to a storage where it can be accessed if needed in future.
  • Data Deletion. Data that is no longer needed or has become obsolete is deleted. Data can also partially be deleted in compliance with the data retention policy, which includes deletion to redact and suppress sensitive data.

Understanding the data lifecycle, and having a well-defined management process bring benefits in terms of:

  • Ensuring that data is created and managed according to FAIR principles
  • Quality control
  • Compliance, security and privacy
  • Efficiency In standardizing and streamlining data management activities
  • Ensuring that the business and technical data stewards are engaged to facilitate the right decisions, and data lifecycle events communicated to all affected parties to prevent any adverse downstream effects

The following pages define the dataset lifecycle processes: