Background information and examples of database schema changes

Schema changes are usually attended in the order they are requested (assuming they can be applied right away). If in doubt, request it/notify #DBAs as early as possible. SREs follow a 3-month goal schedule, so task prioritization and support needs are discussed at the end of March, June, September and December. Those teams that communicated with SREs Managers or the DBA team before the 3-month period explicitly will have a guaranteed window and dedicated time in the following 3 months. Those that communicate late in the 3-month schedule will be queued in order and susceptible of delays due to variable workload and variable available time (e.g. vacations) time.

Alters to tables will be performed as soon as possible within the available resources. If the schema change has to be done in a particular timeframe (for example, depending on code deployment), notifying the DBAs in advance is strongly suggested (an easy change can take 1 week, but a very complex one can take months).

Database schema changes are usually quite time consuming and need to be requested early enough to allow for discussion (as needed), taking into account ongoing work and general task prioritization while supporting the overall team.

Be aware that complex schema changes can take much longer than others that are ‘easier’ to do. Some changes can take a week, other changes can take multiple months (or even years) to fully be in production. The update / alteration to the database tables will be performed as soon as possible using the available resources.

The SRE team follows a progressive three (3) month work schedule, meaning that upcoming tasking is determined near the end of each of the following months: March, June, September and December, please note:

Teams that communicate their schema changes early (before the next next 3 month planning session) will have a planned time slot within the next 3 month goal cycle to get started on the changes.
Teams that communicate the need for a schema change late in the 3 month schedule will be queued up in the order that changes are requested. These changes are subject to delays due to variable schedules, workloads, and time.
If in doubt, request all schema changes early, once a proposed feature has been scoped and is nearly ready to be started. This might mean requesting a complex change months in advance.
Be sure to notify additional teams (other than SRE), as they also might need advance notification that a complex change to a schema is requested and pending. Some of those teams might be Core Platform, Analytics, the Data Dumps team, and others.
If this db schema change is for your own database (and not affecting other teams / individuals), you probably don’t need to go through this process.

Examples of schema changes

{add easy and complex samples here with phab tickets. also can add in changes that were thought to be easy or hard and very much turned into something else}

What makes a schema change easy or hard?

In general, taking a look at https://dev.mysql.com/doc/refman/5.7/en/innodb-online-ddl-operations.html and spotting your operation can give you a good indication if the schema change can be done in a relatively fast way. If it permits concurrent DDL operations it means it most likely can be done without master failover, not in all cases and not for all tables though, see this example: https://phabricator.wikimedia.org/T210713#4967984).

If the operation that is needed does not permit online DDL operations, it means the table will be locked for the amount of time the alter tables (if it an used table, this is normally more than 5 seconds, which is a no-go). If this is the case, it means we have to do a master failover (we currently have 8 masters) to be able to do the alter, and this might require several weeks (as it requires read only time on the wikis).

Relatively easy schema change

Add/remove an index
Add a value to an enum column (at the end of the list)

Complex schema change

Add a column (remove a column is complex but it is generally a low priority ticket).
- Despite the fact that adding a column can be done online, sometimes if the table is very active and has many concurrent transactions, it can fail requiring a master failover. See this example: https://phabricator.wikimedia.org/T210713#4967984

Tables that that can hit this problem:

Change a primary key
Add an unique key
Changing a column data type.

Examples of already done schema changes

Example of an easy schema change

Adding an index to a relatively small table (change_tag): T203709
- What made this schema change easy?
  - Usage of the template provided, in a very clear way.
  - Adding an index is a relatively fast operation (depending on the table size)
  - Adding an index doesn't require a table rebuilt.
  - Table wasn't too big (biggest one was on wikidatawiki and it was 44GB there).
- How long did it take to fully finish the schema change?
  - It took around 2 months (as there were DBA holidays in between), but without those holidays it would have taken around 2-3 weeks, which is pretty fast for a schema change. Even if it is a small table and it doesn't require a table being rebuilt, it still requires us to depool/repool all the MySQL slaves to be able to deal with any possible table metadata lock. As of today depooling a slave takes time (although the SRE Team is working on a faster way to depool hosts which won't need a MediaWiki deploy each time).

Example of a difficult schema change

Changing data type on several column: https://phabricator.wikimedia.org/T187089#3966227
- What made this schema change
  - We had to change datatype on several columns, all of them in use, so given that changing a data type isn't an operation that allows concurrent DDL operations (https://dev.mysql.com/doc/refman/5.7/en/innodb-online-ddl-operations.html#online-ddl-column-operations) that requires to failover all the masters in the active datacenter. As failing over masters requires read only time for the wikis affected, and we had a full datacenter failover kind of scheduled we decided to wait for it, so we could alter the pending masters once they were passive. The schema change was also complex also because not all the wikis required the same changes, as they were some historic drifts between some old and newer wikis.

How long did it take to fully finish the schema change?
- It took around 7 months since we started to work on that schema change to be able to fully get it done.

Wikimedia project tags

These are used to request the attention of an operator and are proactively monitored by WMF DBAs:

DBA aka #Wikimedia-Database component: Tickets related to database administration of the WMF, and issues found on them. E.g.: "Please recover the table X on day Y.". "db1022 is down" "db1047 has lag"
blocked-on-schema-change tag: A schema change has been fully agreed (equivalent to +2, even if it is not yet merged), and we are waiting for the DBAs to perform the change on the WMF databases. Please note that a complex schema change could take up several months to be performed once it has started to be applied.

MediaWiki project tags

These are freely managed by MediaWiki developers and are not proactively monitored by WMF DBAs, as they may or not be merged, or may or may not be used by WMF wikis:

MediaWiki-Database component: Tickets related to programming/fixing issues of all kinds of database backends. DBAs can and will be happy to help with this, but have to be notified.
schema-change tag: a change in MediaWiki that is work in progress. Wikimedia DBAs should be (ideally) taken into consideration for possible optimizations, potential problems, etc.

Advice on schema changes

All tables must have a primary key. That is not a suggestion, it is a MediaWiki policy. Primary keys prevent duplicate rows, allow better performance for ROW-based replication, and are required to perform online schema changes by many tools. Add one to your table before it gets too big.
Do not break compatibility with existing code. Usually, that forces to perform two code deployments, one where code is compatible with new and old structure, and the final one. While it is a lot of more effort, please take into account that schema changes are not transactional, they could take hours or even days to be deployed on all servers, on all wikis. Adding new columns at the end of the table is another way to do it, as code should be compatible with that. Once data has been migrated or filled in, the old columns could be dropped.
If a table is deployed to multiple wikis/databases, any schema changes must be applied to all wikis where that table is present for consistency.
New indexes, at least on core functionality, should be documented on tables.json. This improves maintainability and allows to detect redundant indexes or perform further alter changes with better context of what each index is used for.
DBAs are here to help, not to be an obstacle. If you have any doubt about a schema change (data types, performance, etc.), just ask.
Adding new columns at the end is preferred, although not a rule

Potential dangers of schema changes

Certain ALTER operations require to stop writes for a while while they are ongoing. For the master, this means that new edits and updates to the wikis are stopped or fail, for the slaves, it means replication lag, which can lead to other kind of bad consequences, such as querying stale data, or, with the protections put in place, not being able to read at all (as master reads are avoided as much as possible).
ALTER TABLES on large tables can take days to execute (currently, ALTERing the enwiki revision table takes 4-5 days per server. We have 20 enwiki servers.)
Even if an online alter table is possible (with the built-in online ALTER TABLE functionality or pt-online-schema-change), or the table is very small to not care about online changes, in mysql there exists a failsafe called metadata locking that avoids SELECTs running with stale metadata information. This metadata-locking means that, while SELECTS are ongoing, the ALTER TABLE cannot start, and it is queued. While the alter table is queued, all subsequent transactions are also blocked. Please note that tools like pt-online-schema change, while online, requires also metadata locking on the original table due to trigger creation. This happens on tables such as the image table on Commons, and other popular tables.
Schema changes can impact the performance of the server while ongoing, making it more susceptible to lag problems or overload. In order to perform a schema change successfully it is needed to prepare the deployment by warming up the tables on the slaves to minimize lag issues, and configure the online schema change tool adequately.
All schema change, like any code deployment, is a source of potential application breakage. Sometimes, in the most strange ways: deleting an apparently unused index can lead to a server failure if it was, in-fact, used and the application starts performing full table scans, maximizing connection count. Adding a new index can cause a better index to stop being used due to confusion of the mysql query planner. Please note that testing on beta is a requirement, but not a guarantee that things will work in production due to the high amount of traffic and the large amount of rows on certain tables (query plan is dynamic and depends on the status of the table, it is not static).
Schema changes, unlike code, cannot be rolled back. Only the reverse schema change can be executed again, assuming data has not been lost in the process. What it is worse, changes done with ALTER TABLE, if failed or cancelled, start rollbacking the process, that could take the same or more than applying the changes in the first place.
Sometimes, adding new columns has a lot of overhead in space consumed, and that could lead to server failure if the schema change fails do the lack of disk space. This is particularly impacting on large tables such as revision or page.
Schema changes have to be coordinated so two incompatible changes are not running at the same time. Also, by grouping schema changes happening on the same table, usually a huge speed up is achieved and the applications is for less time in a degraded state.

For all those reasons, which can be summarized on assuring our system's reliability, schema changes should be carefully reviewed and applied by a Database administrator, or by someone at Operations/Platform engineering to avoid larger issues.

What is a database schema change/examples