Help:Toolforge/Database/Replica drift

From Wikitech
Jump to: navigation, search
Toolforge HelpFAQRulesDevelopingDatabasesJob gridKubernetesWebHow toList of ToolsGlossary

Replica drift was a recurring problem for the Wiki Replicas prior to the introduction of row-based replication (RBR) between the sanitarium server(s) and their upstream sources. The RBR replication used to populate the *.{analytics,web}.db.svc.eqiad.wmflabs servers will not allow arbitrary differences in data to be synchronized. If there is a replication failure it will halt all replication with the master server which will in turn raise an alert that will be noticed and corrected.

Historic problem

Prior to the switch to RBR, the replicated databases were not exact copies of the production database, which caused the database to slowly drift from the production contents. This was visible in various queries, but queries that involved recently deleted/restored pages seemed to be affected the most. The impact of this was kept as small as possible by regular database re-imports.

Why did this happen?

The cause for the drift was that certain data-altering MediaWiki queries, under certain circumstances, produced different results on the Wiki Replicas and production. Replicas did not simply repeat every query sent to the master server, and this meant the databases would drift from each other.

For example, when a revision is undeleted by MediaWiki it is done with a query is something like:
INSERT INTO revision SELECT * FROM archive WHERE ...

That query can create different output when executed by different servers. The archive id can be different because that id was blocked by another connection; if the locks are different, the ids are different, and the replicas drift.

Based on reports, the main offenders were probably deleting/undeleting pages and auto_increment ids. In the long term, this should be solved on the MediaWiki side. (See phab:T108255, phab:T112637)

Why doesn't this happen on production?

The solution in production is the nuclear option: if a server is detected to have a difference, we nuke it and clone it, which takes 1 hour. This is not possible in Cloud Services due to several differences between production and the Wiki Replicas:

  • We cannot simply copy from production because the table contents have to be sanitized.
  • The copy cannot be a binary copy because the Wiki Replica servers use extra compression.

How did things get better?

  • 3 new database servers were ordered. With these servers, we migrated back to InnoDB tables - that reduced clone time dramatically.
  • We switched to row based replication between the sanitarium servers and their upstream sources.
  • A full reimport was done to bring the servers to a stable starting state.