Help:Toolforge/Database/Replica drift

From Wikitech
Jump to navigation Jump to search


This page documents an issue known as Replica drift.

This problem was solved for wikireplicas by introducing Row-based replication starting on production, so most issues, if not all, should have already disappeared (only anecdotal cases could reappear).

If you detect what you think is a drift- report a ticket on Wikimedia's Phabricator ( with an SQL query, expected results and obtained results, with the tags #data-services and #dba.


Replica drift was a recurring problem for the Wiki Replicas prior to the introduction of row-based replication (RBR) between the sanitarium server(s) and their upstream sources. The RBR replication used to populate the *.{analytics,web} servers will not allow arbitrary differences in data to be synchronized. If there is a replication failure it will halt all replication with the master server which will in turn raise an alert that will be noticed and corrected.

Prior to the switch to RBR, the replicated databases were not exact copies of the production database, which caused the database to slowly drift from the production contents. This was visible in various queries, but queries that involved recently deleted/restored pages seemed to be affected the most. The impact of this was kept as small as possible by regular database re-imports.

Why did this happen?

The cause for the drift was that certain data-altering MediaWiki queries, under certain circumstances, produced different results on the Wiki Replicas and production. Replicas did not simply repeat every query sent to the master server, and this meant the databases would drift from each other.

For example, when a revision is undeleted by MediaWiki it is done with a query is something like:

INSERT INTO revision SELECT * FROM archive WHERE ...

That query can create different output when executed by different servers. The archive id can be different because that id was blocked by another connection; if the locks are different, the ids are different, and the replicas drift.

Based on reports, the main offenders were probably deleting/undeleting pages and auto_increment ids. In the long term, this should be solved on the MediaWiki side. (See phab:T108255, phab:T112637)

Why doesn't this happen on production?

The solution in production is the nuclear option: if a server is detected to have a difference, we nuke it and clone it, which takes 1 hour. This is not possible in Cloud Services due to several differences between production and the Wiki Replicas:

  • We cannot simply copy from production because the table contents have to be sanitized.
  • The copy cannot be a binary copy because the Wiki Replica servers use extra compression.

How did things got better?

  • 3 new database servers were ordered. With these servers, we migrated back to InnoDB tables - that reduced clone time dramatically.
  • We switched to row-based replication between the sanitarium servers and their upstream sources.
  • A full reimport was done to bring the servers to a stable starting state.

Communication and support

We communicate and provide support through several primary channels. Please reach out with questions and to join the conversation.

Communicate with us
Way Connect Best for
Phabricator Workboard #Cloud-Services Task tracking and bug reporting
IRC Channel #wikimedia-cloud connect
Telegram bridge
mattermost bridge
General discussion and support
Mailing List cloud@ Information about ongoing initiatives, general discussion and support
Announcement emails cloud-announce@ Information about critical changes (all messages mirrored to cloud@)
News wiki page News Information about major near-term plans
Cloud Services Blog Clouds & Unicorns Learning more details about some of our work
Wikimedia Technical Blog News and stories from the Wikimedia technical movement