Obsolete:Toolforge/Database/Replica drift

From Wikitech
Jump to navigation Jump to search
This page is historical and documents a problem which has been addressed.

Overview

This page documents an issue known as Replica drift.

This problem was solved for wikireplicas by introducing Row-based replication starting on production, so most issues, if not all, should have already disappeared (only anecdotal cases could reappear).

If you detect what you think is a drift- report a ticket on Wikimedia's Phabricator (https://phabricator.wikimedia.org) with an SQL query, expected results and obtained results, with the tags #data-services and #dba.

History

Replica drift was a recurring problem for the Wiki Replicas prior to the introduction of row-based replication (RBR) between the sanitarium server(s) and their upstream sources. The RBR replication used to populate the *.{analytics,web}.db.svc.eqiad1.wikimedia.cloud servers will not allow arbitrary differences in data to be synchronized. If there is a replication failure it will halt all replication with the master server which will in turn raise an alert that will be noticed and corrected.

Prior to the switch to RBR, the replicated databases were not exact copies of the production database, which caused the database to slowly drift from the production contents. This was visible in various queries, but queries that involved recently deleted/restored pages seemed to be affected the most. The impact of this was kept as small as possible by regular database re-imports.

Why did this happen?

The cause for the drift was that certain data-altering MediaWiki queries, under certain circumstances, produced different results on the Wiki Replicas and production. Replicas did not simply repeat every query sent to the master server, and this meant the databases would drift from each other.

For example, when a revision is undeleted by MediaWiki it is done with a query is something like:

INSERT INTO revision SELECT * FROM archive WHERE ...

That query can create different output when executed by different servers. The archive id can be different because that id was blocked by another connection; if the locks are different, the ids are different, and the replicas drift.

Based on reports, the main offenders were probably deleting/undeleting pages and auto_increment ids. In the long term, this should be solved on the MediaWiki side. (See phab:T108255, phab:T112637)

Why doesn't this happen on production?

The solution in production is the nuclear option: if a server is detected to have a difference, we nuke it and clone it, which takes 1 hour. This is not possible in Cloud Services due to several differences between production and the Wiki Replicas:

  • We cannot simply copy from production because the table contents have to be sanitized.
  • The copy cannot be a binary copy because the Wiki Replica servers use extra compression.

How did things got better?

  • 3 new database servers were ordered. With these servers, we migrated back to InnoDB tables - that reduced clone time dramatically.
  • We switched to row-based replication between the sanitarium servers and their upstream sources.
  • A full reimport was done to bring the servers to a stable starting state.

Communication and support

Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation:

Discuss and receive general support
Stay aware of critical changes and plans
Track work tasks and report bugs

Use a subproject of the #Cloud-Services Phabricator project to track confirmed bug reports and feature requests about the Cloud Services infrastructure itself

Read stories and WMCS blog posts

Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)