Data Incident management

From Wikitech

The process outlined below relates to handling quality issues. If the data incident is a data breach (e.g. it involves unauthorized access to, disclosure of, or exfiltration of confidential data or personal information [as defined by the Privacy Policy]),  follow the Privacy and Security breach protocols and file a protected Phabricator security report.

Data quality issues relate to issues that result from data processing

The data incident process progresses through the following stages:

  • Detect
  • Alert
  • Triage
  • Resolve
  • Review

Detect

A data issue is usually identified in the following ways:

  • Via an automated data alerting mechanism. These alerts are related to a data pipeline failure or automated data quality checks
  • A data consumer or end user may observe that the data is not available, contains unexpected values (including PII), is not consistent with findings or deviates from the usual trends in a way that indicates a possible data issue.
  • A data steward or user may detect an issue during a periodic review as well.

Alert

  • If the alert was part of the automated DQ alerting process follow the OpsWeek protocol
    • The opsweek engineer will review the issue within 2-12hrs of the alert during regular business hours
    • The opsweek engineer will respond to the alert email with next steps
    • If the issue is related to a data pipeline and known remediation steps are known they will be applied and the resolution communicated via the alert email and Slack #data-engineering-collab channel if the issue has affected end users
    • If the issue cannot be handled by routine remediation steps proceed to the steps below.
  • If there is a suspicion - ambiguity
    • Post your findings in #data-engineering-collab Slack Channel with your question
    • If the dataset has a business and technical steward tag them in your message
    • Tag Will Doran and Andreas Hoelzl
    • If it is established that there is a likely data issue follow the steps below to file a phabricator bug report
    • The data platform team will acknowledge the post and provide any pointers within the day during regular business hours
  • If there is a data issue follow the process below:
    • File a phabricator bug report. Provide as much information including the physical location of data, initial findings, sql statements used, screenshots as well as impact of the issue, deadlines and its severity. However, do not share the data itself, and be mindful of data publication guidelines if posting a screenshot of the data.
    • If the affected datasets have a designated business and technical stewards tag them in the phabricator ticket. You can find this information in DataHub by searching for the specific dataset. If there are no stewards assigned, reach out to Will Doran and/or Andreas Hoelzl.  
    • If it is a critical data issue add a note on the #data-engineering-collab Slack channel, provide the phabricator ticket, tag the data stewards if they have been assigned, as well as Will Doran and Andreas Hoelzl to escalate the prioritization

Triage

  • If the issue was automatically reported and handled as part of the OpsWeek protocol with routine remediation steps, resolve as prescribed and communicate resolution with a response to the original alert email. A note in Slack #data-engineering-collab should be provided if the issue has affected end users.
  • Once the incident is reported in Phabricator (& Slack), if it is marked as high priority it will be reviewed through an escalated prioritization process within the business day by the EMs and PM, otherwise it will be assessed as part of the regular sprint planning
  • If it is clear that there is a data issue related to a system it will be assigned to the corresponding engineering team
  • If the issue is related to a pipeline, involve the team that owns that pipeline
  • If the issue requires further investigation, oftentimes it involves collaboration between the RDS, DPE and possibly other teams. In that case, ensure that all the team managers are involved in designating the right parties from their team to support the triage and resolution
  • If the data incident is not a transient, routine issue such as the delay of data processing, and the issue has non trivial downstream impact, the business data steward is responsible for opening a data report in DQ Reports by using the DQ Report Template in collaboration with the technical data steward where additional incident details are needed. At this point the root cause or the remediation steps may not be known.
  • The data steward is responsible to provide an initial report and share with the affected parties (Data Engineering, Product Analytics, Research, Community Growth’s Data, Evaluation and Learning) to establish recommendation and any remediation steps by posting on #data-engineering-collab. Reshare the post on #working-with-data.
  • Define clear accountable parties and the hand-off protocol between individuals who are doing research (e.g between data engineers and analysts).
  • To facilitate a quick resolution, first create a closed Slack conversation, or depending on the complexity of the issue a dedicated private Slack channel.
  • Schedule coworking sessions to problem-solve, followed by regular check-ins. At each handoff update the ticket with relevant information, reassigning the ticket when applicable. Post an update in the slack channel tagging the Business Data Steward and the new assignee.
  • Assess if the data retention needs to be temporarily extended to allow for data correction. Notify legal to amend the policy and notify corresponding parties.  

Resolve

Review

  • Depending on the severity of the incident, the data steward will schedule a post mortem with the goal of establishing:
    • How the incident process went and if there are any improvements to be made
    • The root causes of the problem
    • File longer term remediation tasks
    • Assess additional data quality checks that should be added to prevent this or similar issues
    • Add any troubleshooting queries to DataHub to the query section associated with that dataset