Talk:Incident documentation/20200407-Wikidata's wb items per site table dropped
DB backup restoration comment
I disagree with the following statement What went poorly? Restore from backup options were not appealing or fast.
It was the only option we had and in fact it is how we've fixed, but it was very night in UTC time and it is not straightforward to do, specially when such big tables are involved. There is surely room for improvement there, and in fact we are currently working on that, but restoring such a big table in production with all the uncertainty that we had at the time is hard: restore on the master? lag? what happens with the writes that were currently happening etc... Marostegui (talk) 13:25, 7 April 2020 (UTC)
- This is still wrong. You are using 2 SRE concepts which make no sense in context to me. Availability: ( total_time - downtime ) / total time over a period of time. Availability over outage period of backup/restore system was 100%. Coverage: amount of objects backed up (or recoverable) / total amount of objects. There was complete coverage of the dropped table (it was available on all backups, by duplicate, on 2 different geographical locations)- this is not theoretical, a backup _was_ used to restore the table successfully. We have daily backups, which is the maximum amount we can physically store, and binary logs allowing us to do point in time recovery (again, not theoretical, we recovered the table in the moment (transaction) just before the DROP was executed. I wasn't involved in the initial incident response, but if the decision then was not to recover then (for any reason), please do not blame the setup. Specially when a recovery from backups *was* done the next day. If the point of this is to say that I (as a person) wasn't available to recover, please say so explicitly (and I won't take that personally, as in that case the issue is the coverage of people not knowing how to do that/deciding against that). -- Jcrespo 05:01, 8 April 2020 (UTC)
- Thanks for rewording it. I am not sure I still agree though. The only way we'd have be able to recover this faster would have been, essentially, not going to have some rest and start working right away on the recovery itself, both, DBAs and developers. But is probably not realistic, specially for this case, given the timezones and the urgency of the table.
- Thankfully, dropping a table in production isn't something that happens often, and for such case it is better to have more people around, and specially fresh. Time zone coverage might have helped, but I am not really sure a single DBA in the US timezone (when this happened) would have solved it faster.
- For sure automation will help on fixing this faster, but given the times this happened and where the expertise is located at the moment (either DBAs and WMDE), people needed to rest and troubleshooting started only 4 hours after we decided we were up enough that the recovery could wait a few hours.
- Obviously, we were lucky this table wasn't super urgent, if revision would have been the dropped table, the issue should've been tackled differently and with more urgency.
- Marostegui (talk) 05:50, 8 April 2020 (UTC)
- sql.php runs LoadExtensionSchemaUpdates which basically means one of the production scripts called sql.php (not update.php) that is being used to debug production database (an interactive shell so you can run db queries against production) mistakenly runs update.php for extensions behind the scene
LoadExtensionSchemaUpdates is not supposed to run updates; it generates a task list and returns it so the updater can execute it (via
sql.php of course does not call). But, the way that's done is that hook handlers get a reference to the updater and need to call utility methods like
dropExtensionTable, which add that table to the task list. For each of those, there is a similarly named function (
dropTable...) which takes the same parameters but performs the action on the spot - this would eventually be used by the updater when it executes the task list, but these methods are also public and if you make the very easy mistake of calling the non-extension version in your extension update hook, you won't see any difference: when you test
update.php, the change is done either way. If someone uses
sql.php and does not have an extension with such a mistake installed, they won't see anything strange. If someone uses
sql.php with an extension with such a mistake installed, but the database is fully up to date,
dropTable will realize there is nothing to do, and print something like "...some_table does not exist", which the user probably won't notice or understand the significance of it. But if you run
sql.php AND you have an extension installed which has mistaken
dropExtensionTable AND that schema changes does apply to the database, boom. --tgr (talk) 13:46, 7 April 2020 (UTC)