Help:Toolforge/Database/Replica drift

From Wikitech
Jump to: navigation, search
Toolforge HelpFAQRulesDevelopingDatabasesJob gridKubernetesWebHow toList of ToolsGlossary

The replicated database are not exact copies of the production database, which causes the database to slowly drift from the production contents. This shows up in various queries, but queries that involve recently deleted/restored pages seem to be affected the most. The impact of this is kept as small as possible by regular database re-imports.

If you encounter replica drift, please report it as a comment on phab:T138967

What's happening?

Currently, it seems impossible to maintain, in the current form, an accurate AND secure labs db at the same time. In the future, it might still be impossible to make labs 100% correct, all the time, but we can hopefully make the issues so few and spaced that they can be detected and solved quickly.

Why is this happening?

The cause for the drift is that certain MediaWiki queries, under certain circumstances, get different results on labs and production. Replicas do not simply repeat every query sent to the master server, and this means the databases will drift from each other.

For example, when a revision is undeleted by MediaWiki it is done with a query is something like:
INSERT INTO revision SELECT * FROM archive WHERE ...

That query can create different output when executed by different servers. For example, the archive id can be different because that id was blocked by another connection: If the locks are different, the ids are different, and the slaves drift.

Based on reports, the main offenders are probably deleting/undeleting pages and auto_increment ids. It would be best if this were solved on the MediaWiki side. (See phab:T108255, phab:T112637)

Why doesn't this happen on production?

The solution in production is the nuclear option: if a server is detected to have a difference, we nuke it and clone it, which takes 1 hour. This is not possible due to several differences between production and labs:

  • We cannot simply copy from production because the contents have to be sanitized.
  • The copy cannot be a binary copy because the labs servers use extra compression.

How can things get better?

  • 3 new database servers have been ordered. With these servers, we can migrate back to InnoDB- that will reduce clone time dramatically. See phab:T136860.
  • Switching to row based replication in production would also help.
  • Full reimports will solve old issues, and bring us to a more reasonable current state.
  • MediaWiki should move to safe queries (phab:T108255, phab:T112637)