Incidents/20181104-s8-replication

From Wikitech

Summary

Several s8 replicas (db1101, db2045, db1092, db1116, db1104, db1099, db1109) had replication stopped because of error:


Error Duplicate entry 745452474-1295751 for key PRIMARY on query. Default database: wikidatawiki.

Manual intervention was needed as deleting the extra row from the database.

The table where the duplicate entry showed up was: wikidatawiki.revision_comment_temp the row can be found with:

select * from revision_comment_temp where revcomment_rev=745452474 and revcomment_comment_id=1295751;

The next day a table level comparison was run between the s8 master (db1071) and one of the previously broken slaves (db1104) to see if there are any more extra rows exist on the replicas which aren't on the master, but we didn't find any.

Timeline

(The times are in UTC)

  • 23:25: The errors are start to show up in #wikimedia-operations (and pages probably sent out - I did not found them in the icinga event logs)
  • 23:38: Jynus removes the extra row on db1104 and restarts replication
  • 23:39: The host db1104 reports the replication is working
  • 23:42: Jynus removes the extra row on the other servers
  • 23:44: The rest of the servers reports working replication

Conclusions

There could be hidden data inconsistencies in the rarely-written parts of the database clusters. We have a theory that the ongoing maintenance scripts on s8 (ongoing comment migration) caused the issue.

Links to relevant documentations

https://phabricator.wikimedia.org/T208695

Actionables

There's an ongoing project for continuous monitoring of the possible data drifts across the databases (https://phabricator.wikimedia.org/T207253) If that service gets implemented these kind of errors could be detected before they cause any outages.