News/Wiki Replica c1 and c3 shutdown

From Wikitech

The labsdb1001.eqiad.wmnet (aka c1.labsdb) and labsdb1003.eqiad.wmnet (aka c3.labsdb) servers were shutdown and permanently removed from service. This page contains historical information about the plan that was followed prior to their shutdown.

TL;DR

  • Change your tools and scripts to use:
    • *.web.db.svc.eqiad.wmflabs (real-time response needed)
    • *.analytics.db.svc.eqiad.wmflabs (batch jobs; long queries)
  • Replace * with either a shard name (e.g. s1) or a wikidb name (e.g. enwiki).
  • The new servers do not support user created databases/tables because replication can't be guaranteed. See T156869 and below for more information.
  • Migrate your user created tables to tools.db.svc.eqiad.wmflabs (also known as tools.labsdb) and JOIN via application space logic rather than in-process in the database.
    • Find your tool(s) in the first 2 tabs at https:///tool-db-usage.toolforge.org/ -- labsdb1001 users can also find their usernames in a list at phab:P6184 to make it easier to correlate the pXXX/sXXX/uXXX strings with tools. (labsdb1003 will have the same next week.)


What is changing?

Monday 2017-10-30, 14:30 UTC Yes Done
  • Reboot labsdb1001.eqiad.wmnet (aka c1.labsdb) for kernel updates (task T168584)
  • There is a possibility of catastrophic hardware failure in this reboot. There will be no way to recover the server or the data it currently hosts if that happens.
Tuesday 07 2017-11-07, 14:30 UTC ☒N Not done
  • Reboot labsdb1003.eqiad.wmnet (aka c3.labsdb) for kernel updates (task T168584)
  • Due to the failure of labsdb1001.eqiad.wmnet following its reboot, the reboot of labsdb1003.eqiad.wmnet has been cancelled.
Wednesday 2017-12-13 Yes Done
  • *.labsdb service names switched to point at *.web.db.svc.eqiad.wmflabs equivalents.
  • User created tables will not be allowed on the new servers.
Thursday 2017-12-14 Yes Done
  • DBAs will stop replication from production hosts to labsdb1003.eqiad.wmnet
  • DBAs will make databases on labsdb1003.eqiad.wmnet read-only for all users
Wednesday 2018-01-17
  • labsdb1001.eqiad.wmnet removed from service permanently.
  • labsdb1003.eqiad.wmnet removed from service permanently.
  • c1.labsdb service name will be removed from DNS.
  • c3.labsdb service name will be removed from DNS.

Why are we doing this?

There are two clusters of physical servers which provide the Wiki Replicas to Toolforge and other Cloud VPS users. The older of the two clusters consists of the physical hosts labsdb1001.eqiad.wmnet and labsdb1003.eqiad.wmnet. These hosts are also known by the aliases c1.labsdb and c3.labsdb in documentation and application code. These two hosts are among the oldest hardware still operating in our production server farm. The hardware is out of warranty and old enough that we are concerned that it could fail catastrophically at any time.

As announced on 2017-09-25, a new cluster of physical servers is ready to replace the older hardware. This change comes along with a breaking feature change however. These new servers will not allow users to create their own databases/tables co-located with the replicated content. This was a feature of the older database servers that some tools used to improve performance by making intermediate tables that could then be JOINed to other tables to produce certain results.

We looked for solutions that would allow us to replicate user created data across the three servers, but we could not come up with a solution that would guarantee success. The user created tables on the current servers are not backed up or replicated and have always carried the disclaimer that these tables may disappear at any time. With the improvements in our ability to fail over and rebalance traffic under load, it is more likely on the new cluster that these tables would randomly appear and disappear from the point of view of a given user. This kind of disruption will break tools if we allow it. It seems a safer solution for everyone to disallow the former functionality.

User created databases and tables are still supported on the tools.db.svc.eqiad.wmflabs server (also known as tools.labsdb). If you are using tables co-located on the current c1.labsdb or c3.labsdb hosts we are recommending that your tool/scripts be updated to instead keep all user managed data on tools.db.svc.eqiad.wmflabs and perform any joining of replica data and user created data in application space rather than with cross-database joins.