Data Platform/Data Incident management
Appearance
The process outlined below relates to handling quality issues. If the data incident is a data breach (e.g. it involves unauthorized access to, disclosure of, or exfiltration of confidential data or personal information [as defined by the Privacy Policy]), follow the Privacy and Security breach protocols and file a protected Phabricator security report.
Data quality issues relate to issues that result from data processing
The data incident process progresses through the following stages:
- Detect
- Alert
- Triage
- Resolve
- Review
Detect
A data issue is usually identified in the following ways:
- Via an automated data alerting mechanism. These alerts are related to a data pipeline failure or automated data quality checks
- A data consumer or end user may observe that the data is not available, contains unexpected values (including PII), is not consistent with findings or deviates from the usual trends in a way that indicates a possible data issue.
- A data steward or user may detect an issue during a periodic review as well.
Alert
- If the alert was part of the automated DQ alerting process follow the OpsWeek protocol
- The opsweek engineer will review the issue within 2-12hrs of the alert during regular business hours
- The opsweek engineer will respond to the alert email with next steps
- If the issue is related to a data pipeline and known remediation steps are known they will be applied and the resolution communicated via the alert email and Slack #data-engineering-collab channel if the issue has affected end users
- If the issue cannot be handled by routine remediation steps proceed to the steps below.
- If there is a suspicion - ambiguity
- Post your findings in #data-engineering-collab Slack Channel with your question
- If the dataset has a business and technical steward tag them in your message
- Tag Will Doran(Continuity Page) and Andreas Hoelzl
- If it is established that there is a likely data issue follow the steps below to file a phabricator bug report
- The data platform team will acknowledge the post and provide any pointers within the day during regular business hours
- If there is a data issue follow the process below:
- File a phabricator bug report. Provide as much information including the physical location of data, initial findings, sql statements used, screenshots as well as impact of the issue, deadlines and its severity. However, do not share the data itself, and be mindful of data publication guidelines if posting a screenshot of the data.
- If the affected datasets have a designated business and technical stewards tag them in the phabricator ticket. You can find this information in DataHub by searching for the specific dataset. If there are no stewards assigned, reach out to Will Doran and/or Andreas Hoelzl.
- If it is a critical data issue add a note on the #data-engineering-collab Slack channel, provide the phabricator ticket, tag the data stewards if they have been assigned, as well as Will Doran and Andreas Hoelzl to escalate the prioritization
Triage
- If the issue was automatically reported and handled as part of the OpsWeek protocol with routine remediation steps, resolve as prescribed and communicate resolution with a response to the original alert email. A note in Slack #data-engineering-collab should be provided if the issue has affected end users.
- Once the incident is reported in Phabricator (& Slack), if it is marked as high priority it will be reviewed through an escalated prioritization process within the business day by the EMs and PM, otherwise it will be assessed as part of the regular sprint planning
- If it is clear that there is a data issue related to a system it will be assigned to the corresponding engineering team
- If the issue is related to a pipeline, involve the team that owns that pipeline
- If the issue requires further investigation, oftentimes it involves collaboration between the RDS, DPE and possibly other teams. In that case, ensure that all the team managers are involved in designating the right parties from their team to support the triage and resolution
- If the data incident is not a transient, routine issue such as the delay of data processing, and the issue has non trivial downstream impact, the business data steward is responsible for opening a data report in DQ Reports by using the DQ Report Template in collaboration with the technical data steward where additional incident details are needed. At this point the root cause or the remediation steps may not be known.The data steward is responsible to provide an initial report and share with the affected parties (Data Engineering, Product Analytics, Research, Community Growth’s Data, Evaluation and Learning) to establish recommendation and any remediation steps by posting on #data-engineering-collab. Reshare the post on #working-with-data.
- Clearly identify accountable parties and the hand-off protocol between individuals who are doing research (e.g between data engineers and analysts) and specifically:
- Incident coordinator - this is usually the data steward, but can be anyone else in the team
- Incident response team and in parituclar lead analyst and lead data engineer
- To facilitate a quick resolution, first create a closed Slack conversation, or depending on the complexity of the issue a dedicated private Slack channel.
- Schedule coworking sessions to problem-solve, followed by regular check-ins. At each handoff update the ticket with relevant information, reassigning the ticket when applicable. Post an update in the slack channel tagging the Business Data Steward and the new assignee.
- Assess if the data retention needs to be temporarily extended to allow for data correction. Notify legal to amend the policy and notify corresponding parties.
Resolve
- Assess if there is an expedited interim solution that can be applied to address the issue and implement accordingly
- Address the cause of the issue
- Correct the data when possible, including affected downstream datasets
- Once the issue is resolved, update the incident status and notify all parties, explaining any follow up actions
- Apply any date filters and annotations as documented in https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Data_Issues
- Add testing (unit or data quality tests)
- Update the DQ report and publish the report on wikitech under https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Data_Issues
Review
- Depending on the severity of the incident, the data steward will schedule a post mortem with the goal of establishing:
- How the incident process went and if there are any improvements to be made
- The root causes of the problem
- File longer term remediation tasks
- Assess additional data quality checks that should be added to prevent this or similar issues
- Add any troubleshooting queries to DataHub to the query section associated with that dataset